yuying zhang created TIKA-4492:
----------------------------------
Summary: Large file parsing fails (RecordFormatException), using
FileInputStream throws exception, but using
TikaInputStream.get(Path)successfully parses
Key: TIKA-4492
URL: https://issues.apache.org/jira/browse/TIKA-4492
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 3.0.0
Reporter: yuying zhang
I encountered a {{org.apache.tika.exception.TikaException:
org.apache.poi.ooxml.util.RecordFormatException}} exception when using
{{AutoDetectParser}} to parse a 20MB full text {{docx}} file.
Using the follow code snippet for parsing (throws exception):
{code:java}
FileInputStream fileInputStream = new FileInputStream(file);
autoDetectParser.parse(fileInputStream,handler,metadata,context);{code}
Try using TikaInputStram to wrap the input stream:
{code:java}
TikaInputStream tikaInputStream = new TikaInputStream(file);
autoDetectParser.parse(tikaInputStream,handler,metadata,context); {code}
I looked at the source code of TikaInputStream.parse(InputStream,
ContentHandler, Metadata, ParseContext) and found it internally calls
TikaInputStream tis = TikaInputStream.get(stream, tmp, metadata)
Why does directly using {{FileInputStream}} cause the parsing of a 20MB
{{docx}} file to fail? Why does using {{TikaInputStream.get()}} or calling
{{TikaInputStream.parse()}} succeed?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)