Earlier metadata extraction in ParsingReader
--------------------------------------------
Key: TIKA-203
URL: https://issues.apache.org/jira/browse/TIKA-203
Project: Tika
Issue Type: Improvement
Components: parser
Reporter: Jukka Zitting
Priority: Minor
The normal parse() method guarantees that all extracted metadata will be
available in the metadata object once the method returns. But since the
ParsingReader class runs the parse() method in a background thread, one can
only assume that extracted metadata is available once the entire character
stream has been consumed. This is troublesome for example when creating Lucene
Document objects, as Lucene postpones reading the given character stream to
when the already constructed Document is passed to an IndexWriter. The result
is that (depending on thread scheduling and the structure of the input document
format) metadata may not be available for inclusion in the indexed Document.
One way of fixing this issue is to add a small character buffer in
ParsingReader, and to make sure that the buffer is filled with extracted text
before the ParsingReader constructor returns. This should ensure that relevant
document metadata is almost always available, since the majority of document
formats have all or most metadata at the beginning of the document stream.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.