[jira] [Comment Edited] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata

Steve R (JIRA) Mon, 23 Jun 2014 20:11:29 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041639#comment-14041639
 ]


Steve R edited comment on TIKA-1353 at 6/24/14 3:09 AM:
--------------------------------------------------------

Ignore my suggested code example, it clearly doesn't work.

My question is now this, why is the following code commented out? It seems to 
work.

        /*
         ZipFile zipFile; 
         if (stream instanceof TikaInputStream) { 
                TikaInputStream tis = (TikaInputStream) stream;
                Object container = ((TikaInputStream) 
stream).getOpenContainer(); 
                if (container instanceof ZipFile) { 
                    zipFile  = (ZipFile) container; 
                 else if (tis.hasFile()) { 
                     zipFile = new ZipFile(tis.getFile()); 
                 } 
            }
         */

        // TODO: if incoming IS is a TIS with a file
        // associated, we should open ZipFile so we can
        // visit metadata, mimetype first; today we lose
        // all the metadata if meta.xml is hit after
        // content.xml in the stream. Then we can still
        // read-once for the content.xml.


was (Author: svramusi):
Ignore my suggested code example, it clearly doesn't work.

My question is now this, why is the following code commented out? It seems to 
work.

        /*
         * ZipFile zipFile; if (stream instanceof TikaInputStream) { 
TikaInputStream tis = (TikaInputStream) stream;
         * Object container = ((TikaInputStream) stream).getOpenContainer(); if 
(container instanceof ZipFile) { zipFile
         * = (ZipFile) container; } else if (tis.hasFile()) { zipFile = new 
ZipFile(tis.getFile()); } }
         */

        // TODO: if incoming IS is a TIS with a file
        // associated, we should open ZipFile so we can
        // visit metadata, mimetype first; today we lose
        // all the metadata if meta.xml is hit after
        // content.xml in the stream. Then we can still
        // read-once for the content.xml.

> OpenDocumentParser doesn't correctly process metadata
> -----------------------------------------------------
>
>                 Key: TIKA-1353
>                 URL: https://issues.apache.org/jira/browse/TIKA-1353
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.5
>            Reporter: Steve R
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using OpenDocumentParser, the metadata isn't set correctly. When using 
> it to write an html file, the only metadata that it knows about is content 
> type because it is set ahead of time.
> The problem is that when iterating over the zip contents, meta.xml isn't 
> processed before content.xml. The metadata set on the parse object is correct 
> after parse() returns, however the contents of the resulting html file is 
> missing all of the metadata.
> Changing the code to be 
> boolean parsedMetaData = false;
> boolean delayLoadContent = false;
> while (entry != null) {
> ...
> } else if (entry.getName().equals("meta.xml")) {
>                 meta.parse(zip, new DefaultHandler(), metadata, context);
>                 parsedMetaData = true;
>                 if (delayLoadContent) {
>                     if (content instanceof OpenDocumentContentParser) {
>                         ((OpenDocumentContentParser) 
> content).parseInternal(zip, handler, metadata, context);
>                     } else {
>                         // Foreign content parser was set:
>                         content.parse(zip, handler, metadata, context);
>                     }
>                 }
>             } else if (entry.getName().endsWith("content.xml")) {
>                 if (!parsedMetaData) {
>                     delayLoadContent = true;
>                 } else {
>                     if (content instanceof OpenDocumentContentParser) {
>                         ((OpenDocumentContentParser) 
> content).parseInternal(zip, handler, metadata, context);
>                     } else {
>                         // Foreign content parser was set:
>                         content.parse(zip, handler, metadata, context);
>                     }
>                 }
>             }
> works as expected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata

Reply via email to