[jira] [Comment Edited] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents
[ https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134180#comment-15134180 ] Daniel Bonniot de Ruisselet edited comment on TIKA-1854 at 2/5/16 1:47 PM: --- The documents I'm processing sometimes have embedded files representing chemical structures. Having the storage class IDs allow me to know which embedded documents contain chemical structures, and to know in which format they are, so that I can process them accordingly. What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE) already returns for instance "application/vnd.ms-excel" for embedded excel documents. However it is not populated for chemical or other formats. I might be mistaken, but it seems to me that the documentation you linked to is about the mime type of the main (container) document. Is the same mechanism used to determine the mime type of the embedded documents? I think the specific formats I'm interested in are not in widespread use, so for contributions to Tika I'm rather focused on a generic solution. Getting the storage class IDs will definitely be useful in such cases. If custom mime types worked for embedded documents that could also be useful. was (Author: dbr): The documents I'm processing sometimes have embedded files representing chemical structures. Having the storage class IDs allow me to know which embedded chemical structures, and to know in which format they are, so that I can process them accordingly. What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE) already returns for instance "application/vnd.ms-excel" for embedded excel documents. However it is not populated for chemical or other formats. I might be mistaken, but it seems to me that the documentation you linked to is about the mime type of the main (container) document. Is the same mechanism used to determine the mime type of the embedded documents? I think the specific formats I'm interested in are not in widespread use, so for contributions to Tika I'm rather focused on a generic solution. Getting the storage class IDs will definitely be useful in such cases. If custom mime types worked for embedded documents that could also be useful. > Include the storage class ID of documents embedded in MS Office documents > - > > Key: TIKA-1854 > URL: https://issues.apache.org/jira/browse/TIKA-1854 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Daniel Bonniot de Ruisselet >Assignee: Tim Allison > Attachments: class-id.patch > > > When processing embedded documents using an EmbeddedDocumentExtractor, the > storage class ID of the embedded document would be a useful metadata to have, > but it's currently missing. > I'll promptly attach a patch implementing and testing this new feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents
[ https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134180#comment-15134180 ] Daniel Bonniot de Ruisselet commented on TIKA-1854: --- The documents I'm processing sometimes have embedded files representing chemical structures. Having the storage class IDs allow me to know which embedded chemical structures, and to know in which format they are, so that I can process them accordingly. What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE) already returns for instance "application/vnd.ms-excel" for embedded excel documents. However it is not populated for chemical or other formats. I might be mistaken, but it seems to me that the documentation you linked to is about the mime type of the main (container) document. Is the same mechanism used to determine the mime type of the embedded documents? I think the specific formats I'm interested in are not in widespread use, so for contributions to Tika I'm rather focused on a generic solution. Getting the storage class IDs will definitely be useful in such cases. If custom mime types worked for embedded documents that could also be useful. > Include the storage class ID of documents embedded in MS Office documents > - > > Key: TIKA-1854 > URL: https://issues.apache.org/jira/browse/TIKA-1854 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Daniel Bonniot de Ruisselet >Assignee: Tim Allison > Attachments: class-id.patch > > > When processing embedded documents using an EmbeddedDocumentExtractor, the > storage class ID of the embedded document would be a useful metadata to have, > but it's currently missing. > I'll promptly attach a patch implementing and testing this new feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents
[ https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Bonniot de Ruisselet updated TIKA-1854: -- Attachment: class-id.patch > Include the storage class ID of documents embedded in MS Office documents > - > > Key: TIKA-1854 > URL: https://issues.apache.org/jira/browse/TIKA-1854 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Daniel Bonniot de Ruisselet > Attachments: class-id.patch > > > When processing embedded documents using an EmbeddedDocumentExtractor, the > storage class ID of the embedded document would be a useful metadata to have, > but it's currently missing. > I'll promptly attach a patch implementing and testing this new feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents
[ https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133916#comment-15133916 ] Daniel Bonniot de Ruisselet commented on TIKA-1854: --- By the way, the Content-Type of the embedded document IS already available, but this only works for some popular formats (e.g. embedded MS Office documents). Is there a way for clients to configure the Content-Type detection for more exotic formats? > Include the storage class ID of documents embedded in MS Office documents > - > > Key: TIKA-1854 > URL: https://issues.apache.org/jira/browse/TIKA-1854 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Daniel Bonniot de Ruisselet > Attachments: class-id.patch > > > When processing embedded documents using an EmbeddedDocumentExtractor, the > storage class ID of the embedded document would be a useful metadata to have, > but it's currently missing. > I'll promptly attach a patch implementing and testing this new feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents
Daniel Bonniot de Ruisselet created TIKA-1854: - Summary: Include the storage class ID of documents embedded in MS Office documents Key: TIKA-1854 URL: https://issues.apache.org/jira/browse/TIKA-1854 Project: Tika Issue Type: Improvement Components: parser Reporter: Daniel Bonniot de Ruisselet When processing embedded documents using an EmbeddedDocumentExtractor, the storage class ID of the embedded document would be a useful metadata to have, but it's currently missing. I'll promptly attach a patch implementing and testing this new feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1017) DefaultHtmlMapper misses some safe elements
[ https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346702#comment-14346702 ] Daniel Bonniot de Ruisselet edited comment on TIKA-1017 at 3/4/15 10:36 AM: If we want to preserve the semantics, maybe at least SUB and SUP should be added? For instance in a scientific document, "ab" and "ab" might be different concepts, which are lost if you only get "a b". If we want to keep all "safe" elements, we could also add at least I, B, EM and STRONG. It's easy enough to use a custom mapper, so this is not a huge issue, but a good default is always nice. Given the above, maybe only add SUB and SUP? was (Author: dbr): If we want to preserve the semantics, maybe at least SUB and SUP should be added? For instance in a scientific document, "ab" and "ab" might be different concepts, which are lost if you only get "a b". If we want to keep all "safe" elements, we could also add at least I, B, EM and STRONG. It's easy enough to use another mapper, so this is not a huge issue, but a good default is always nice. Given the above, maybe only add SUB and SUP? > DefaultHtmlMapper misses some safe elements > --- > > Key: TIKA-1017 > URL: https://issues.apache.org/jira/browse/TIKA-1017 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Daniel Bonniot de Ruisselet > > The code of DefaultHtmlMapper says that the list of "safe" elements is based > on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd > Elements like and are not included in the safe list. Is this > intentional (a comment with the rationale would be useful) or should they be > added? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1017) DefaultHtmlMapper misses some safe elements
[ https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346702#comment-14346702 ] Daniel Bonniot de Ruisselet commented on TIKA-1017: --- If we want to preserve the semantics, maybe at least SUB and SUP should be added? For instance in a scientific document, "ab" and "ab" might be different concepts, which are lost if you only get "a b". If we want to keep all "safe" elements, we could also add at least I, B, EM and STRONG. It's easy enough to use another mapper, so this is not a huge issue, but a good default is always nice. Given the above, maybe only add SUB and SUP? > DefaultHtmlMapper misses some safe elements > --- > > Key: TIKA-1017 > URL: https://issues.apache.org/jira/browse/TIKA-1017 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Daniel Bonniot de Ruisselet > > The code of DefaultHtmlMapper says that the list of "safe" elements is based > on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd > Elements like and are not included in the safe list. Is this > intentional (a comment with the rationale would be useful) or should they be > added? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1167) Embedded object not extracted
[ https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752209#comment-13752209 ] Daniel Bonniot de Ruisselet edited comment on TIKA-1167 at 8/28/13 11:47 AM: - After further analysis, I think support for such cases probably needs to be done in POI (but comments welcome if someone has further insight). I posted comments and tentative a patch to this POI bug: https://issues.apache.org/bugzilla/show_bug.cgi?id=51891 Even if that works out well, it would probably be useful to add a test at the Tika level as well. The OLE parsing seems rather sensitive (for a reason, the format itself looks messy and poorly documented). Also, integration of POI and Tika is seems tight. So it can only help to test things work at different levels. was (Author: dbr): After further analysis, I think support for such cases probably needs to be done in POI (but comments welcome if someone has further insight). I'm working on submitting an issue and probably a tentative a patch there. Will link to it here when it exists. Even if that works out well, it would probably be useful to add a test at the Tika level as well. The OLE parsing seems rather sensitive (for a reason, the format itself looks messy and poorly documented). Also, integration of POI and Tika is seems tight. So it can only help to test things work at different levels. > Embedded object not extracted > - > > Key: TIKA-1167 > URL: https://issues.apache.org/jira/browse/TIKA-1167 > Project: Tika > Issue Type: Bug >Affects Versions: 1.4 >Reporter: Daniel Bonniot de Ruisselet >Priority: Critical > Fix For: 1.5 > > Attachments: Doc w Structure that wont extract.docx > > > For the attached docx, tika seems to detect the embedded object, as shown by > this tag: > {{}} > However, extraction itself (using -z on the command line, or using the API) > does not seem to work for this object: > {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}} > {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to > /tmp/tika/rId9_image1.wmf}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1167) Embedded object not extracted
[ https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752209#comment-13752209 ] Daniel Bonniot de Ruisselet commented on TIKA-1167: --- After further analysis, I think support for such cases probably needs to be done in POI (but comments welcome if someone has further insight). I'm working on submitting an issue and probably a tentative a patch there. Will link to it here when it exists. Even if that works out well, it would probably be useful to add a test at the Tika level as well. The OLE parsing seems rather sensitive (for a reason, the format itself looks messy and poorly documented). Also, integration of POI and Tika is seems tight. So it can only help to test things work at different levels. > Embedded object not extracted > - > > Key: TIKA-1167 > URL: https://issues.apache.org/jira/browse/TIKA-1167 > Project: Tika > Issue Type: Bug >Affects Versions: 1.4 >Reporter: Daniel Bonniot de Ruisselet >Priority: Critical > Fix For: 1.5 > > Attachments: Doc w Structure that wont extract.docx > > > For the attached docx, tika seems to detect the embedded object, as shown by > this tag: > {{}} > However, extraction itself (using -z on the command line, or using the API) > does not seem to work for this object: > {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}} > {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to > /tmp/tika/rId9_image1.wmf}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1167) Embedded object not extracted
[ https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Bonniot de Ruisselet updated TIKA-1167: -- Attachment: Doc w Structure that wont extract.docx > Embedded object not extracted > - > > Key: TIKA-1167 > URL: https://issues.apache.org/jira/browse/TIKA-1167 > Project: Tika > Issue Type: Bug >Affects Versions: 1.4 >Reporter: Daniel Bonniot de Ruisselet >Priority: Critical > Fix For: 1.5 > > Attachments: Doc w Structure that wont extract.docx > > > For the attached docx, tika seems to detect the embedded object, as shown by > this tag: > {{}} > However, extraction itself (using -z on the command line, or using the API) > does not seem to work for this object: > {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}} > {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to > /tmp/tika/rId9_image1.wmf}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1167) Embedded object not extracted
Daniel Bonniot de Ruisselet created TIKA-1167: - Summary: Embedded object not extracted Key: TIKA-1167 URL: https://issues.apache.org/jira/browse/TIKA-1167 Project: Tika Issue Type: Bug Affects Versions: 1.4 Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.5 Attachments: Doc w Structure that wont extract.docx For the attached docx, tika seems to detect the embedded object, as shown by this tag: {{}} However, extraction itself (using -z on the command line, or using the API) does not seem to work for this object: {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}} {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to /tmp/tika/rId9_image1.wmf}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694659#comment-13694659 ] Daniel Bonniot de Ruisselet edited comment on TIKA-1109 at 6/27/13 12:27 PM: - I tried it. It broke two tests (same cause): as you mentioned, in excel the metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a change in how that is implemented, and: {{[INFO] }} {{[INFO] Building Apache Tika 1.5-SNAPSHOT}} {{[INFO] }} {{[INFO]}} {{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}} {{[INFO] }} {{[INFO] Reactor Summary:}} {{[INFO]}} {{[INFO] Apache Tika parent SUCCESS [0.806s]}} {{[INFO] Apache Tika core .. SUCCESS [8.418s]}} {{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}} {{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}} {{[INFO] Apache Tika application ... SUCCESS [3.336s]}} {{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}} {{[INFO] Apache Tika server SUCCESS [5.312s]}} {{[INFO] Apache Tika ... SUCCESS [0.014s]}} {{[INFO] }} {{[INFO] BUILD SUCCESS}} {{[INFO] }} {{[INFO] Total time: 47.498s}} {{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}} {{[INFO] Final Memory: 27M/1930M}} {{[INFO] }} {{dbonniot@naming:~/world/tika$ svn diff | diffstat}} {{main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java | 11 -}} {{main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java | 36 ++}} {{test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java | 56 ++}} {{3 files changed, 74 insertions(+), 29 deletions(-)}} {{dbonniot@naming:~/world/tika$ svn diff > /tmp/TIKA-1109.patch}} The logic is OOXMLExtractorFactory is now simpler, since I could remove the extra shielding, which I suppose was made necessary by the previous ordering. And the metadata for OOXML formats is now available at parse time, as tested by the added test to OOXMLParserTest :) was (Author: dbr): I tried it. It broke two tests (same cause): as you mentioned, in excel the metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a change in how that is implemented, and: {{[INFO] }} {{[INFO] Building Apache Tika 1.5-SNAPSHOT}} {{[INFO] }} {{[INFO]}} {{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}} {{[INFO] }} {{[INFO] Reactor Summary:}} {{[INFO]}} {{[INFO] Apache Tika parent SUCCESS [0.806s]}} {{[INFO] Apache Tika core .. SUCCESS [8.418s]}} {{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}} {{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}} {{[INFO] Apache Tika application ... SUCCESS [3.336s]}} {{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}} {{[INFO] Apache Tika server SUCCESS [5.312s]}} {{[INFO] Apache Tika ... SUCCESS [0.014s]}} {{[INFO] }} {{[INFO] BUILD SUCCESS}} {{[INFO] }} {{[INFO] Total time: 47.498s}} {{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}} {{[INFO] Final Memory: 27M/1930M}} {{[INFO] }} {{dbonniot@naming:~/world/tika$ svn diff | diffstat}} {{ main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java | 11 -}} {{ main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java | 36 ++}} {{ test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java | 56 ++}} {{ 3 files changed, 74 insertions(+), 29 deletions(-)}} {{dbonniot@naming:~/world/tika$ svn diff > /tmp/TIKA-1109.patch}} The logic is OOXMLExtractorFactory is now sim
[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Bonniot de Ruisselet updated TIKA-1109: -- Attachment: TIKA-1109.patch > Metadata not extracted before the content in OOXML (pptx) > - > > Key: TIKA-1109 > URL: https://issues.apache.org/jira/browse/TIKA-1109 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Daniel Bonniot de Ruisselet >Priority: Critical > Fix For: 1.5 > > Attachments: TIKA-1109.patch > > > It seems that when processing OOXML documents, the metadata is only read > after the text. This means it's impossible to use the medata while processing > the text. I think it would be more useful to have the metadata populated > first. > As a symptom: > java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx > outputs only as metadata: > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > while there is more medata in the file (e.g. Attachment > Test). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694659#comment-13694659 ] Daniel Bonniot de Ruisselet commented on TIKA-1109: --- I tried it. It broke two tests (same cause): as you mentioned, in excel the metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a change in how that is implemented, and: {{[INFO] }} {{[INFO] Building Apache Tika 1.5-SNAPSHOT}} {{[INFO] }} {{[INFO]}} {{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}} {{[INFO] }} {{[INFO] Reactor Summary:}} {{[INFO]}} {{[INFO] Apache Tika parent SUCCESS [0.806s]}} {{[INFO] Apache Tika core .. SUCCESS [8.418s]}} {{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}} {{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}} {{[INFO] Apache Tika application ... SUCCESS [3.336s]}} {{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}} {{[INFO] Apache Tika server SUCCESS [5.312s]}} {{[INFO] Apache Tika ... SUCCESS [0.014s]}} {{[INFO] }} {{[INFO] BUILD SUCCESS}} {{[INFO] }} {{[INFO] Total time: 47.498s}} {{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}} {{[INFO] Final Memory: 27M/1930M}} {{[INFO] }} {{dbonniot@naming:~/world/tika$ svn diff | diffstat}} {{ main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java | 11 -}} {{ main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java | 36 ++}} {{ test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java | 56 ++}} {{ 3 files changed, 74 insertions(+), 29 deletions(-)}} {{dbonniot@naming:~/world/tika$ svn diff > /tmp/TIKA-1109.patch}} The logic is OOXMLExtractorFactory is now simpler, since I could remove the extra shielding, which I suppose was made necessary by the previous ordering. And the metadata for OOXML formats is now available at parse time, as tested by the added test to OOXMLParserTest :) > Metadata not extracted before the content in OOXML (pptx) > - > > Key: TIKA-1109 > URL: https://issues.apache.org/jira/browse/TIKA-1109 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Daniel Bonniot de Ruisselet >Priority: Critical > Fix For: 1.5 > > > It seems that when processing OOXML documents, the metadata is only read > after the text. This means it's impossible to use the medata while processing > the text. I think it would be more useful to have the metadata populated > first. > As a symptom: > java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx > outputs only as metadata: > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > while there is more medata in the file (e.g. Attachment > Test). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Bonniot de Ruisselet updated TIKA-1109: -- Summary: Metadata not extracted before the content in OOXML (pptx) (was: Metadata not extracted before the context in OOXML (pptx)) > Metadata not extracted before the content in OOXML (pptx) > - > > Key: TIKA-1109 > URL: https://issues.apache.org/jira/browse/TIKA-1109 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Daniel Bonniot de Ruisselet >Priority: Critical > Fix For: 1.5 > > > It seems that when processing OOXML documents, the metadata is only read > after the text. This means it's impossible to use the medata while processing > the text. I think it would be more useful to have the metadata populated > first. > As a symptom: > java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx > outputs only as metadata: > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > while there is more medata in the file (e.g. Attachment > Test). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694521#comment-13694521 ] Daniel Bonniot de Ruisselet commented on TIKA-1109: --- Nick, thanks a lot for your explanation. If I understand correctly, what you are saying is that in general it cannot be guaranteed that the metadata is available during parsing, since that will depend on the format whether that's possible or not. That makes complete sense. Here I am asking specifically about the OOXML formats, with an example pptx file. As I understand the OOXML formats are zip files containing xml files. In test-classes/test-documents/testPPT.pptx, the metadata seems to be inside docProps/core.xml. Would it be possible to read the metadata first from there, before starting the parsing? > Metadata not extracted before the context in OOXML (pptx) > - > > Key: TIKA-1109 > URL: https://issues.apache.org/jira/browse/TIKA-1109 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Daniel Bonniot de Ruisselet >Priority: Critical > Fix For: 1.5 > > > It seems that when processing OOXML documents, the metadata is only read > after the text. This means it's impossible to use the medata while processing > the text. I think it would be more useful to have the metadata populated > first. > As a symptom: > java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx > outputs only as metadata: > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > while there is more medata in the file (e.g. Attachment > Test). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635096#comment-13635096 ] Daniel Bonniot de Ruisselet edited comment on TIKA-1109 at 4/18/13 11:24 AM: - Nick, thanks a lot for your answer. I do use the API, and I see the same behaviour: when my ContentHandler is called on the text data, the metadata is not set (yet). It is only set when endDocument is called. Am I doing something wrong? Looking at: http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java?revision=1339390&view=markup and specifically: 108 // We need to get the content first, but not end 109 // the document just yet 110 EndDocumentShieldingContentHandler handler = 111new EndDocumentShieldingContentHandler(baseHandler); 112 extractor.getXHTML(handler, metadata, context); 113 114 // Now we can get the metadata 115 extractor.getMetadataExtractor().extract(metadata); It seems to me that metadata is read only after the text, which explains the behaviour. Why is this needed? Am I misunderstanding something? was (Author: dbr): Nick, thanks a lot for your answer. I do use the API, and I see the same behaviour: when my ContentHandler is called on the text data, the metadata is not set (yet). It is only set when endDocument is called. Am I doing something wrong? Looking at: http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java?revision=1339390&view=markup and specifically: > Metadata not extracted before the context in OOXML (pptx) > - > > Key: TIKA-1109 > URL: https://issues.apache.org/jira/browse/TIKA-1109 > Project: Tika > Issue Type: Bug >Reporter: Daniel Bonniot de Ruisselet >Priority: Critical > Fix For: 1.4 > > > It seems that when processing OOXML documents, the metadata is only read > after the text. This means it's impossible to use the medata while processing > the text. I think it would be more useful to have the metadata populated > first. > As a symptom: > java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx > outputs only as metadata: > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > while there is more medata in the file (e.g. Attachment > Test). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635096#comment-13635096 ] Daniel Bonniot de Ruisselet commented on TIKA-1109: --- Nick, thanks a lot for your answer. I do use the API, and I see the same behaviour: when my ContentHandler is called on the text data, the metadata is not set (yet). It is only set when endDocument is called. Am I doing something wrong? Looking at: http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java?revision=1339390&view=markup and specifically: > Metadata not extracted before the context in OOXML (pptx) > - > > Key: TIKA-1109 > URL: https://issues.apache.org/jira/browse/TIKA-1109 > Project: Tika > Issue Type: Bug >Reporter: Daniel Bonniot de Ruisselet >Priority: Critical > Fix For: 1.4 > > > It seems that when processing OOXML documents, the metadata is only read > after the text. This means it's impossible to use the medata while processing > the text. I think it would be more useful to have the metadata populated > first. > As a symptom: > java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx > outputs only as metadata: > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > while there is more medata in the file (e.g. Attachment > Test). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Bonniot de Ruisselet updated TIKA-1109: -- Description: It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: while there is more medata in the file (e.g. Attachment Test). was: It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: > Metadata not extracted before the context in OOXML (pptx) > - > > Key: TIKA-1109 > URL: https://issues.apache.org/jira/browse/TIKA-1109 > Project: Tika > Issue Type: Bug >Reporter: Daniel Bonniot de Ruisselet >Priority: Critical > Fix For: 1.4 > > > It seems that when processing OOXML documents, the metadata is only read > after the text. This means it's impossible to use the medata while processing > the text. I think it would be more useful to have the metadata populated > first. > As a symptom: > java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx > outputs only as metadata: > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > while there is more medata in the file (e.g. Attachment > Test). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)
Daniel Bonniot de Ruisselet created TIKA-1109: - Summary: Metadata not extracted before the context in OOXML (pptx) Key: TIKA-1109 URL: https://issues.apache.org/jira/browse/TIKA-1109 Project: Tika Issue Type: Bug Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.4 It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1108) Represent individual slides in pptx
Daniel Bonniot de Ruisselet created TIKA-1108: - Summary: Represent individual slides in pptx Key: TIKA-1108 URL: https://issues.apache.org/jira/browse/TIKA-1108 Project: Tika Issue Type: Improvement Reporter: Daniel Bonniot de Ruisselet Fix For: 1.4 When parsing ppt, tika produces for each slide: However for pptx these seem to be missing, all the text is directly under . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1017) DefaultHtmlMapper misses some safe elements
Daniel Bonniot de Ruisselet created TIKA-1017: - Summary: DefaultHtmlMapper misses some safe elements Key: TIKA-1017 URL: https://issues.apache.org/jira/browse/TIKA-1017 Project: Tika Issue Type: Bug Reporter: Daniel Bonniot de Ruisselet The code of DefaultHtmlMapper says that the list of "safe" elements is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd Elements like and are not included in the safe list. Is this intentional (a comment with the rationale would be useful) or should they be added? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-820) Locator is unset for HTML parser
[ https://issues.apache.org/jira/browse/TIKA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460573#comment-13460573 ] Daniel Bonniot de Ruisselet commented on TIKA-820: -- Hi Ken - Thanks for looking at the patch. I have no idea if this is the only missing delegating call, it just seemed wrong to me not to do it in TextContentHandler. > Locator is unset for HTML parser > > > Key: TIKA-820 > URL: https://issues.apache.org/jira/browse/TIKA-820 > Project: Tika > Issue Type: Bug > Components: general, parser >Affects Versions: 1.0 >Reporter: Daniel Bonniot de Ruisselet >Assignee: Ken Krugler > Labels: patch > Fix For: 1.3 > > Attachments: text-locator.patch > > > The HtmlParser does not call setDocumentLocator(Locator locator) on the > user's content handler. > Patch and unit test attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-946) Improve how the PPTX parser uses XLSF from POI
[ https://issues.apache.org/jira/browse/TIKA-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452973#comment-13452973 ] Daniel Bonniot de Ruisselet commented on TIKA-946: -- Does it also belong to this task that the output would represent the structures of slides (one element per slide)? > Improve how the PPTX parser uses XLSF from POI > -- > > Key: TIKA-946 > URL: https://issues.apache.org/jira/browse/TIKA-946 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 >Reporter: Nick Burch > > One last bit from TIKA-757 and TIKA-805 - the current way that PPTX files are > parsed using XSLF from Apache POI has a couple of last remaining low level > parts. > We should avoid the need to go from the usermodel XMLSlideShow to the low > level XSLFSlideShow to do the text extraction (occurs in > XSLFPowerPointExtractorDecorator). > We should also update the usermodel slide support to extract out the slide > names from docProps/app.xml, so that these can be included in the text output > easily (in XSLFPowerPointExtractor) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira