hi,list

i am using nutch-0.8.1 which use poi as its msword parsing solution.
it works well while dealing with English doc, even the doc file is pretty large.

but it comes StringIndexOutOfBoundException when the doc(only one page) is written in Chinese characters.

i try to isolate the problem, and find out that if i use HWPFDocument.getRange().text() to read a local Chinese file, it's ok. But in nutch's way, DocumentInputStream->CHPBinTable->ComplexFileTable->TextPieceTable...,finally it will meet StringIndexOutOfBoundException because the parameter in TextPiece.substring() is negative.

I am going to do some futher study on this but wonder if anyone else has had similar
experiences?

thanks


TKDD

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Reply via email to