[jira] Created: (TIKA-203) Earlier metadata extraction in ParsingReader

Jukka Zitting (JIRA) Wed, 11 Feb 2009 08:15:30 -0800

Earlier metadata extraction in ParsingReader
--------------------------------------------


                 Key: TIKA-203
                 URL: https://issues.apache.org/jira/browse/TIKA-203
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Jukka Zitting
            Priority: Minor


The normal parse() method guarantees that all extracted metadata will be 
available in the metadata object once the method returns. But since the 
ParsingReader class runs the parse() method in a background thread, one can 
only assume that extracted metadata is available once the entire character 
stream has been consumed. This is troublesome for example when creating Lucene 
Document objects, as Lucene postpones reading the given character stream to 
when the already constructed Document is passed to an IndexWriter. The result 
is that (depending on thread scheduling and the structure of the input document 
format) metadata may not be available for inclusion in the indexed Document.

One way of fixing this issue is to add a small character buffer in 
ParsingReader, and to make sure that the buffer is filled with extracted text 
before the ParsingReader constructor returns. This should ensure that relevant 
document metadata is almost always available, since the majority of document 
formats have all or most metadata at the beginning of the document stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (TIKA-203) Earlier metadata extraction in ParsingReader

Reply via email to