[
https://issues.apache.org/jira/browse/TIKA-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732522#action_12732522
]
Daan de Wit commented on TIKA-203:
----------------------------------
Created TIKA-262, reproducable also on WinXP, so does not seem to be related to
the OS
> Earlier metadata extraction in ParsingReader
> --------------------------------------------
>
> Key: TIKA-203
> URL: https://issues.apache.org/jira/browse/TIKA-203
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.3
>
> Attachments: lipsum.doc
>
>
> The normal parse() method guarantees that all extracted metadata will be
> available in the metadata object once the method returns. But since the
> ParsingReader class runs the parse() method in a background thread, one can
> only assume that extracted metadata is available once the entire character
> stream has been consumed. This is troublesome for example when creating
> Lucene Document objects, as Lucene postpones reading the given character
> stream to when the already constructed Document is passed to an IndexWriter.
> The result is that (depending on thread scheduling and the structure of the
> input document format) metadata may not be available for inclusion in the
> indexed Document.
> One way of fixing this issue is to add a small character buffer in
> ParsingReader, and to make sure that the buffer is filled with extracted text
> before the ParsingReader constructor returns. This should ensure that
> relevant document metadata is almost always available, since the majority of
> document formats have all or most metadata at the beginning of the document
> stream.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.