[jira] Commented: (TIKA-203) Earlier metadata extraction in ParsingReader

Daan de Wit (JIRA) Fri, 17 Jul 2009 06:25:42 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732522#action_12732522
 ]


Daan de Wit commented on TIKA-203:
----------------------------------

Created TIKA-262, reproducable also on WinXP, so does not seem to be related to 
the OS

> Earlier metadata extraction in ParsingReader
> --------------------------------------------
>
>                 Key: TIKA-203
>                 URL: https://issues.apache.org/jira/browse/TIKA-203
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: lipsum.doc
>
>
> The normal parse() method guarantees that all extracted metadata will be 
> available in the metadata object once the method returns. But since the 
> ParsingReader class runs the parse() method in a background thread, one can 
> only assume that extracted metadata is available once the entire character 
> stream has been consumed. This is troublesome for example when creating 
> Lucene Document objects, as Lucene postpones reading the given character 
> stream to when the already constructed Document is passed to an IndexWriter. 
> The result is that (depending on thread scheduling and the structure of the 
> input document format) metadata may not be available for inclusion in the 
> indexed Document.
> One way of fixing this issue is to add a small character buffer in 
> ParsingReader, and to make sure that the buffer is filled with extracted text 
> before the ParsingReader constructor returns. This should ensure that 
> relevant document metadata is almost always available, since the majority of 
> document formats have all or most metadata at the beginning of the document 
> stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-203) Earlier metadata extraction in ParsingReader

Reply via email to