https://bz.apache.org/bugzilla/show_bug.cgi?id=66245

--- Comment #1 from PJ Fanning <[email protected]> ---
I don't know much about the ancient binary formats of MS documents. In this
case, the SprmIterator is iterating over byte[] and there are a few spare bytes
at the end of the array. There could be a bug but I don't feel qualified to
look for it. 

Another option would be to introduce an optional lax mode on the SprmIterator
and have it ignore SprmOperations where initSize method fails. The
WordExtractor could have an option to enable this lax mode. Not guaranteed but
this mode might allow some documents that can't be parsed today to be parsed
with the risk of some text being lost. Of course, the whole output could be
wrong because the issue in the doc could be start early in the iteration and
everything after the first problem could be thrown off.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to