[
https://issues.apache.org/jira/browse/TIKA-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732512#action_12732512
]
Daan de Wit edited comment on TIKA-203 at 7/17/09 6:05 AM:
-----------------------------------------------------------
does not work for me on Ubuntu 8.04 with Sun java 1.5.0_16 on 1 processor when
parsing certain word documents
was (Author: d.de.wit):
does not work for me on Ubuntu 8.04 with Sun java 1.5.0_16 on 1 processor
> Earlier metadata extraction in ParsingReader
> --------------------------------------------
>
> Key: TIKA-203
> URL: https://issues.apache.org/jira/browse/TIKA-203
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.3
>
>
> The normal parse() method guarantees that all extracted metadata will be
> available in the metadata object once the method returns. But since the
> ParsingReader class runs the parse() method in a background thread, one can
> only assume that extracted metadata is available once the entire character
> stream has been consumed. This is troublesome for example when creating
> Lucene Document objects, as Lucene postpones reading the given character
> stream to when the already constructed Document is passed to an IndexWriter.
> The result is that (depending on thread scheduling and the structure of the
> input document format) metadata may not be available for inclusion in the
> indexed Document.
> One way of fixing this issue is to add a small character buffer in
> ParsingReader, and to make sure that the buffer is filled with extracted text
> before the ParsingReader constructor returns. This should ensure that
> relevant document metadata is almost always available, since the majority of
> document formats have all or most metadata at the beginning of the document
> stream.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.