Re: [Haskell-cafe] Text.JSON and utf8

2013-02-16 Thread Iavor Diatchki
Hello Martin,

the change that you propose seems to already be in json-0.7.  Perhaps you
just need to 'cabal update' and install the most recent version?

About your other question:  I have not used CouchDB but a common mistake is
to mix up strings and bytes.  Perhaps the `getDoc` function does not do
utf-8 decoding and so it is giving you back list of bytes (as a String)?

In general, the JSON package only converts between JSON and String, and is
agnostic to what encoding is used to represent the strings.   There are
other packages that convert Strings into bytes (e.g.,
http://hackage.haskell.org/package/utf8-string), so typically you want to
encode the string to bytes before you export it (say to CouchDB), and
decode it back into a string just after you've imported it.

-Iavor





On Mon, Feb 11, 2013 at 5:56 AM, Martin Hilbig li...@mhilbig.de wrote:

 hi,

 tl;dr: i propose this patch to Text/JSON/String.hs and would like to
 know why it is needed:

 @@ -375,7 +375,7 @@
where
go s1 =
  case s1 of
 -  (x   :xs) | x  '\x20' || x  '\x7e' - '\\' : encControl x (go xs)
 +  (x   :xs) | x  '\x20' - '\\' : encControl x (go xs)
('' :xs)  - '\\' : ''  : go xs
('\\':xs)  - '\\' : '\\' : go xs
(x   :xs)  - x: go xs


 i recently stumbled upon CouchDB telling me i'm sending invalid json.

 i basically read lines from a utf8 file with german umlauts and send
 them to CouchDB using Text.JSON and Database.CouchDB.

   $ file lines.txt
   lines.txt: UTF-8 Unicode text

 lets take 'ö' as an example. i use LANG=de_DE.utf8

 ghci tells

  'ö'
 '\246'

  putChar '\246'
 ö

  putChar 'ö'
 ö

  :m + Text.JSON Database.CouchDB
  runCouchDB' $ newNamedDoc (db foo) (doc bar) (showJSON $ toJSObject
 [(test,ö)])
 *** Exception: HTTP/1.1 400 Bad Request
 Server: CouchDB/1.2.1 (Erlang OTP/R15B03)
 Date: Mon, 11 Feb 2013 13:24:49 GMT
 Content-Type: text/plain; charset=utf-8
 Content-Length: 48
 Cache-Control: must-revalidate

 couchdb log says:

   Invalid JSON: {{error,{10,lexical error: invalid bytes in UTF8
 string.\n}},{\test\:\**F6\}}

 this is indeed hex ö:

  :m + Numeric
  putChar $ toEnum $ fst $ head $ readHex f6
 ö

 if i apply the above patch and reinstall JSON and CouchDB the doc
 creation works:

  runCouchDB' $ newNamedDoc (db db) (doc foo) (showJSON $ toJSObject
 [(test, ö)])
 Right someRev

 but i dont get back the ö i expected:

  Just (_,_,x) -runCouchDB' $ getDoc (db foo) (doc bar) :: IO (Maybe
 (Doc,Rev,JSObject String))
  let Ok y = valFromObj test = readJSON x :: Result String
  y
 \195\188
  putStrLn y
 ü

 apperently with curl everything works fine:

 $ curl localhost:5984/db/foo -XPUT -d '{test: ö}'
 {ok:true,id:foo,rev:**someOtherRev}
 $ curl localhost:5984/db/foo
 {_id:bars,_rev:**someOtherRev,test:ö}

 so how can i get my precious ö back? what am i doing wrong or does
 Text.JSON need another patch?

 another question: why does encControl in Text/JSON/String.hs handle the
 cases x  '\x100' and x  '\x1000' even though they can never be
 reached with the old predicate in encJSString (x  '\x20')

 finally: is '\x7e' the right literal for the job?

 thanks for reading

 have fun
 martin

 __**_
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/**mailman/listinfo/haskell-cafehttp://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Text.JSON and utf8

2013-02-11 Thread Martin Hilbig

hi,

tl;dr: i propose this patch to Text/JSON/String.hs and would like to
know why it is needed:

@@ -375,7 +375,7 @@
   where
   go s1 =
 case s1 of
-  (x   :xs) | x  '\x20' || x  '\x7e' - '\\' : encControl x (go xs)
+  (x   :xs) | x  '\x20' - '\\' : encControl x (go xs)
   ('' :xs)  - '\\' : ''  : go xs
   ('\\':xs)  - '\\' : '\\' : go xs
   (x   :xs)  - x: go xs


i recently stumbled upon CouchDB telling me i'm sending invalid json.

i basically read lines from a utf8 file with german umlauts and send
them to CouchDB using Text.JSON and Database.CouchDB.

  $ file lines.txt
  lines.txt: UTF-8 Unicode text

lets take 'ö' as an example. i use LANG=de_DE.utf8

ghci tells

 'ö'
'\246'

 putChar '\246'
ö

 putChar 'ö'
ö

 :m + Text.JSON Database.CouchDB
 runCouchDB' $ newNamedDoc (db foo) (doc bar) (showJSON $ 
toJSObject [(test,ö)])

*** Exception: HTTP/1.1 400 Bad Request
Server: CouchDB/1.2.1 (Erlang OTP/R15B03)
Date: Mon, 11 Feb 2013 13:24:49 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 48
Cache-Control: must-revalidate

couchdb log says:

  Invalid JSON: {{error,{10,lexical error: invalid bytes in UTF8 
string.\n}},{\test\:\F6\}}


this is indeed hex ö:

 :m + Numeric
 putChar $ toEnum $ fst $ head $ readHex f6
ö

if i apply the above patch and reinstall JSON and CouchDB the doc
creation works:

 runCouchDB' $ newNamedDoc (db db) (doc foo) (showJSON $ 
toJSObject [(test, ö)])

Right someRev

but i dont get back the ö i expected:

 Just (_,_,x) -runCouchDB' $ getDoc (db foo) (doc bar) :: IO 
(Maybe (Doc,Rev,JSObject String))

 let Ok y = valFromObj test = readJSON x :: Result String
 y
\195\188
 putStrLn y
ü

apperently with curl everything works fine:

$ curl localhost:5984/db/foo -XPUT -d '{test: ö}'
{ok:true,id:foo,rev:someOtherRev}
$ curl localhost:5984/db/foo
{_id:bars,_rev:someOtherRev,test:ö}

so how can i get my precious ö back? what am i doing wrong or does 
Text.JSON need another patch?


another question: why does encControl in Text/JSON/String.hs handle the
cases x  '\x100' and x  '\x1000' even though they can never be
reached with the old predicate in encJSString (x  '\x20')

finally: is '\x7e' the right literal for the job?

thanks for reading

have fun
martin

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Text.JSON and utf8

2013-02-11 Thread Gregory Collins
Don't use the json package, use aeson instead. (It's much faster and
handles encoding issues correctly).

G


On Mon, Feb 11, 2013 at 2:56 PM, Martin Hilbig li...@mhilbig.de wrote:

 hi,

 tl;dr: i propose this patch to Text/JSON/String.hs and would like to
 know why it is needed:

 @@ -375,7 +375,7 @@
where
go s1 =
  case s1 of
 -  (x   :xs) | x  '\x20' || x  '\x7e' - '\\' : encControl x (go xs)
 +  (x   :xs) | x  '\x20' - '\\' : encControl x (go xs)
('' :xs)  - '\\' : ''  : go xs
('\\':xs)  - '\\' : '\\' : go xs
(x   :xs)  - x: go xs


 i recently stumbled upon CouchDB telling me i'm sending invalid json.

 i basically read lines from a utf8 file with german umlauts and send
 them to CouchDB using Text.JSON and Database.CouchDB.

   $ file lines.txt
   lines.txt: UTF-8 Unicode text

 lets take 'ö' as an example. i use LANG=de_DE.utf8

 ghci tells

  'ö'
 '\246'

  putChar '\246'
 ö

  putChar 'ö'
 ö

  :m + Text.JSON Database.CouchDB
  runCouchDB' $ newNamedDoc (db foo) (doc bar) (showJSON $ toJSObject
 [(test,ö)])
 *** Exception: HTTP/1.1 400 Bad Request
 Server: CouchDB/1.2.1 (Erlang OTP/R15B03)
 Date: Mon, 11 Feb 2013 13:24:49 GMT
 Content-Type: text/plain; charset=utf-8
 Content-Length: 48
 Cache-Control: must-revalidate

 couchdb log says:

   Invalid JSON: {{error,{10,lexical error: invalid bytes in UTF8
 string.\n}},{\test\:\**F6\}}

 this is indeed hex ö:

  :m + Numeric
  putChar $ toEnum $ fst $ head $ readHex f6
 ö

 if i apply the above patch and reinstall JSON and CouchDB the doc
 creation works:

  runCouchDB' $ newNamedDoc (db db) (doc foo) (showJSON $ toJSObject
 [(test, ö)])
 Right someRev

 but i dont get back the ö i expected:

  Just (_,_,x) -runCouchDB' $ getDoc (db foo) (doc bar) :: IO (Maybe
 (Doc,Rev,JSObject String))
  let Ok y = valFromObj test = readJSON x :: Result String
  y
 \195\188
  putStrLn y
 ü

 apperently with curl everything works fine:

 $ curl localhost:5984/db/foo -XPUT -d '{test: ö}'
 {ok:true,id:foo,rev:**someOtherRev}
 $ curl localhost:5984/db/foo
 {_id:bars,_rev:**someOtherRev,test:ö}

 so how can i get my precious ö back? what am i doing wrong or does
 Text.JSON need another patch?

 another question: why does encControl in Text/JSON/String.hs handle the
 cases x  '\x100' and x  '\x1000' even though they can never be
 reached with the old predicate in encJSString (x  '\x20')

 finally: is '\x7e' the right literal for the job?

 thanks for reading

 have fun
 martin

 __**_
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/**mailman/listinfo/haskell-cafehttp://www.haskell.org/mailman/listinfo/haskell-cafe




-- 
Gregory Collins g...@gregorycollins.net
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe