Hello,

I wanted to inquire about a bizarre situation I've run into, with decoding 
a certain uniquely weird kind of JSON lately. I have some JSONs which come 
from a web crawling service, which is fetching webpages from all over the 
world. These pages can be formatted in various insane text encodings, which 
I wish had never existed in the first place, such as Latin1, LatinX, 
Windows-12XX, and in my current case, EUC-JP and Shift-JIS.

The web crawler is generating some JSON out of this, coming into my system, 
which contains lots of hostile and sort-of illegal inputs, such as 
corrupted, unpaired, or otherwise invalid surrogates and other such byte 
sequences in the UTF-8.

Technically, Jackson can deserialize this "just fine", except not really, 
because now you have a whole ton of Java String instances in this tree, 
which have bogus / unknown / invalid / illegal bytes inside, and some tools 
farther downstream from me, trying to use my APIs, are exploding when they 
are trying to deal with these insane bytes which I need to clean up first. 
I could try to make something that goes through and un-corrupts all the 
Strings in the tree, but it's very hard to try to access the original raw 
bytes from inside these damaged Strings and fix them the way they should be.

The good news is that Mozilla and some open-source hackers have made a 
library for dealing with these mangled Strings: 
https://github.com/albfernandez/juniversalchardet . However, there is the 
possibility that every String in a single JSON input from the crawler can 
have some different encoding, So, instead of trying to guess the encoding 
on the entire raw JSON, I need to try and guess the encoding on each String 
before deserializing.

So, I wanted to ask if the system will let me create a custom 
StdDeserializer, which steals the deserialization of String, even though 
it's a kind-of magic builtin Java type and not a regular POJO, so I can 
pass each String through the encoding detector and un-corrupt it, so that 
when Jackson assembles the whole structure, all of the corrupt Strings have 
been eliminated as much as possible, and re-encoded into proper UTF-8, the 
way they always should have been.

Thanks,
Matthew.


-- 
You received this message because you are subscribed to the Google Groups 
"jackson-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jackson-user+unsubscr...@googlegroups.com.
To post to this group, send email to jackson-user@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to