What is the current state and plan for multibyte
character support by Nutch?

As far as I can tell...

The PDF plugin uses PDFBox (www.pdfbox.org) which does not
work with Japanese and probably other multibyte characters
and code sets.

The Word plugin uses POI (http://jakarta.apache.org/poi/),
which doesn't seem to support Japanese. Some patches to
make it possible to support Japanese (and hopefully other
code sets) have been submitted to the POI project but
they have not been integrated because the project currently
has no committer.

RTF document plugin and PowerPoint plugin use home-grown
parsers.  What is the status of multibyte code set
(and single byte code set other than ISO-8859-1) support by
these plugins?

-Kuro


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to