Hello, I wanted to inquire about a bizarre situation I've run into, with decoding a certain uniquely weird kind of JSON lately. I have some JSONs which come from a web crawling service, which is fetching webpages from all over the world. These pages can be formatted in various insane text encodings, which I wish had never existed in the first place, such as Latin1, LatinX, Windows-12XX, and in my current case, EUC-JP and Shift-JIS.
The web crawler is generating some JSON out of this, coming into my system, which contains lots of hostile and sort-of illegal inputs, such as corrupted, unpaired, or otherwise invalid surrogates and other such byte sequences in the UTF-8. Technically, Jackson can deserialize this "just fine", except not really, because now you have a whole ton of Java String instances in this tree, which have bogus / unknown / invalid / illegal bytes inside, and some tools farther downstream from me, trying to use my APIs, are exploding when they are trying to deal with these insane bytes which I need to clean up first. I could try to make something that goes through and un-corrupts all the Strings in the tree, but it's very hard to try to access the original raw bytes from inside these damaged Strings and fix them the way they should be. The good news is that Mozilla and some open-source hackers have made a library for dealing with these mangled Strings: https://github.com/albfernandez/juniversalchardet . However, there is the possibility that every String in a single JSON input from the crawler can have some different encoding, So, instead of trying to guess the encoding on the entire raw JSON, I need to try and guess the encoding on each String before deserializing. So, I wanted to ask if the system will let me create a custom StdDeserializer, which steals the deserialization of String, even though it's a kind-of magic builtin Java type and not a regular POJO, so I can pass each String through the encoding detector and un-corrupt it, so that when Jackson assembles the whole structure, all of the corrupt Strings have been eliminated as much as possible, and re-encoded into proper UTF-8, the way they always should have been. Thanks, Matthew. -- You received this message because you are subscribed to the Google Groups "jackson-user" group. To unsubscribe from this group and stop receiving emails from it, send an email to jackson-user+unsubscr...@googlegroups.com. To post to this group, send email to jackson-user@googlegroups.com. For more options, visit https://groups.google.com/d/optout.