Hi,

I'm trying to build a small perl (could be any scripting language) utility that takes nutch readseg -dump 's output as its input, decodes the content field to utf-8 (independent of what encoding the raw page was in) and outputs that decoded content. After a little bit of experimentation, i find myself unable to decode the content field, even when i try using the various charset hints that are available either in the content metadata, or in the raw content itself.

I was wondering if someone on the list has already succeeded in building this type of functionality, or is the content returned by readseg using a specific encoding that i don't know of ?

cheers,

-y

Reply via email to