[ https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041639#comment-14041639 ]
Steve R edited comment on TIKA-1353 at 6/24/14 3:09 AM: -------------------------------------------------------- Ignore my suggested code example, it clearly doesn't work. My question is now this, why is the following code commented out? It seems to work. /* ZipFile zipFile; if (stream instanceof TikaInputStream) { TikaInputStream tis = (TikaInputStream) stream; Object container = ((TikaInputStream) stream).getOpenContainer(); if (container instanceof ZipFile) { zipFile = (ZipFile) container; else if (tis.hasFile()) { zipFile = new ZipFile(tis.getFile()); } } */ // TODO: if incoming IS is a TIS with a file // associated, we should open ZipFile so we can // visit metadata, mimetype first; today we lose // all the metadata if meta.xml is hit after // content.xml in the stream. Then we can still // read-once for the content.xml. was (Author: svramusi): Ignore my suggested code example, it clearly doesn't work. My question is now this, why is the following code commented out? It seems to work. /* * ZipFile zipFile; if (stream instanceof TikaInputStream) { TikaInputStream tis = (TikaInputStream) stream; * Object container = ((TikaInputStream) stream).getOpenContainer(); if (container instanceof ZipFile) { zipFile * = (ZipFile) container; } else if (tis.hasFile()) { zipFile = new ZipFile(tis.getFile()); } } */ // TODO: if incoming IS is a TIS with a file // associated, we should open ZipFile so we can // visit metadata, mimetype first; today we lose // all the metadata if meta.xml is hit after // content.xml in the stream. Then we can still // read-once for the content.xml. > OpenDocumentParser doesn't correctly process metadata > ----------------------------------------------------- > > Key: TIKA-1353 > URL: https://issues.apache.org/jira/browse/TIKA-1353 > Project: Tika > Issue Type: Bug > Components: metadata, parser > Affects Versions: 1.5 > Reporter: Steve R > Original Estimate: 24h > Remaining Estimate: 24h > > When using OpenDocumentParser, the metadata isn't set correctly. When using > it to write an html file, the only metadata that it knows about is content > type because it is set ahead of time. > The problem is that when iterating over the zip contents, meta.xml isn't > processed before content.xml. The metadata set on the parse object is correct > after parse() returns, however the contents of the resulting html file is > missing all of the metadata. > Changing the code to be > boolean parsedMetaData = false; > boolean delayLoadContent = false; > while (entry != null) { > ... > } else if (entry.getName().equals("meta.xml")) { > meta.parse(zip, new DefaultHandler(), metadata, context); > parsedMetaData = true; > if (delayLoadContent) { > if (content instanceof OpenDocumentContentParser) { > ((OpenDocumentContentParser) > content).parseInternal(zip, handler, metadata, context); > } else { > // Foreign content parser was set: > content.parse(zip, handler, metadata, context); > } > } > } else if (entry.getName().endsWith("content.xml")) { > if (!parsedMetaData) { > delayLoadContent = true; > } else { > if (content instanceof OpenDocumentContentParser) { > ((OpenDocumentContentParser) > content).parseInternal(zip, handler, metadata, context); > } else { > // Foreign content parser was set: > content.parse(zip, handler, metadata, context); > } > } > } > works as expected. -- This message was sent by Atlassian JIRA (v6.2#6252)