https://bz.apache.org/bugzilla/show_bug.cgi?id=66245
--- Comment #1 from PJ Fanning <[email protected]> --- I don't know much about the ancient binary formats of MS documents. In this case, the SprmIterator is iterating over byte[] and there are a few spare bytes at the end of the array. There could be a bug but I don't feel qualified to look for it. Another option would be to introduce an optional lax mode on the SprmIterator and have it ignore SprmOperations where initSize method fails. The WordExtractor could have an option to enable this lax mode. Not guaranteed but this mode might allow some documents that can't be parsed today to be parsed with the risk of some text being lost. Of course, the whole output could be wrong because the issue in the doc could be start early in the iteration and everything after the first problem could be thrown off. -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
