[jira] Issue Comment Edited: (TIKA-203) Earlier metadata extraction in ParsingReader

Daan de Wit (JIRA) Fri, 17 Jul 2009 06:07:43 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732512#action_12732512
 ]


Daan de Wit edited comment on TIKA-203 at 7/17/09 6:05 AM:
-----------------------------------------------------------

does not work for me on Ubuntu 8.04 with Sun java 1.5.0_16 on 1 processor when 
parsing certain word documents

      was (Author: d.de.wit):
    does not work for me on Ubuntu 8.04 with Sun java 1.5.0_16 on 1 processor
  
> Earlier metadata extraction in ParsingReader
> --------------------------------------------
>
>                 Key: TIKA-203
>                 URL: https://issues.apache.org/jira/browse/TIKA-203
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.3
>
>
> The normal parse() method guarantees that all extracted metadata will be 
> available in the metadata object once the method returns. But since the 
> ParsingReader class runs the parse() method in a background thread, one can 
> only assume that extracted metadata is available once the entire character 
> stream has been consumed. This is troublesome for example when creating 
> Lucene Document objects, as Lucene postpones reading the given character 
> stream to when the already constructed Document is passed to an IndexWriter. 
> The result is that (depending on thread scheduling and the structure of the 
> input document format) metadata may not be available for inclusion in the 
> indexed Document.
> One way of fixing this issue is to add a small character buffer in 
> ParsingReader, and to make sure that the buffer is filled with extracted text 
> before the ParsingReader constructor returns. This should ensure that 
> relevant document metadata is almost always available, since the majority of 
> document formats have all or most metadata at the beginning of the document 
> stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (TIKA-203) Earlier metadata extraction in ParsingReader

Reply via email to