Hi,
I'm trying to build a small perl (could be any scripting language)
utility that takes nutch readseg -dump 's output as its input, decodes
the content field to utf-8 (independent of what encoding the raw page
was in) and outputs that decoded content. After a little bit of
experimentation, i find myself unable to decode the content field, even
when i try using the various charset hints that are available either in
the content metadata, or in the raw content itself.
I was wondering if someone on the list has already succeeded in building
this type of functionality, or is the content returned by readseg using
a specific encoding that i don't know of ?
cheers,
-y