decoding nutch readseg -dump 's output

Yves Petinot Mon, 16 Nov 2009 11:31:38 -0800

Hi,

I'm trying to build a small perl (could be any scripting language)utility that takes nutch readseg -dump 's output as its input, decodesthe content field to utf-8 (independent of what encoding the raw pagewas in) and outputs that decoded content. After a little bit ofexperimentation, i find myself unable to decode the content field, evenwhen i try using the various charset hints that are available either inthe content metadata, or in the raw content itself.

I was wondering if someone on the list has already succeeded in buildingthis type of functionality, or is the content returned by readseg usinga specific encoding that i don't know of ?


cheers,

-y

decoding nutch readseg -dump 's output

Reply via email to