[jira] [Comment Edited] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents

2016-02-05 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134180#comment-15134180
 ] 

Daniel Bonniot de Ruisselet edited comment on TIKA-1854 at 2/5/16 1:47 PM:
---

The documents I'm processing sometimes have embedded files representing 
chemical structures. Having the storage class IDs allow me to know which 
embedded documents contain chemical structures, and to know in which format 
they are, so that I can process them accordingly.

What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE) 
already returns for instance "application/vnd.ms-excel" for embedded excel 
documents. However it is not populated for chemical or other formats.

I might be mistaken, but it seems to me that the documentation you linked to is 
about the mime type of the main (container) document. Is the same mechanism 
used to determine the mime type of the embedded documents?

I think the specific formats I'm interested in are not in widespread use, so 
for contributions to Tika I'm rather focused on a generic solution. Getting the 
storage class IDs will definitely be useful in such cases. If custom mime types 
worked for embedded documents that could also be useful.



was (Author: dbr):
The documents I'm processing sometimes have embedded files representing 
chemical structures. Having the storage class IDs allow me to know which 
embedded chemical structures, and to know in which format they are, so that I 
can process them accordingly.

What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE) 
already returns for instance "application/vnd.ms-excel" for embedded excel 
documents. However it is not populated for chemical or other formats.

I might be mistaken, but it seems to me that the documentation you linked to is 
about the mime type of the main (container) document. Is the same mechanism 
used to determine the mime type of the embedded documents?

I think the specific formats I'm interested in are not in widespread use, so 
for contributions to Tika I'm rather focused on a generic solution. Getting the 
storage class IDs will definitely be useful in such cases. If custom mime types 
worked for embedded documents that could also be useful.


> Include the storage class ID of documents embedded in MS Office documents
> -
>
> Key: TIKA-1854
> URL: https://issues.apache.org/jira/browse/TIKA-1854
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
>Assignee: Tim Allison
> Attachments: class-id.patch
>
>
> When processing embedded documents using an EmbeddedDocumentExtractor, the 
> storage class ID of the embedded document would be a useful metadata to have, 
> but it's currently missing.
> I'll promptly attach a patch implementing and testing this new feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents

2016-02-05 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134180#comment-15134180
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1854:
---

The documents I'm processing sometimes have embedded files representing 
chemical structures. Having the storage class IDs allow me to know which 
embedded chemical structures, and to know in which format they are, so that I 
can process them accordingly.

What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE) 
already returns for instance "application/vnd.ms-excel" for embedded excel 
documents. However it is not populated for chemical or other formats.

I might be mistaken, but it seems to me that the documentation you linked to is 
about the mime type of the main (container) document. Is the same mechanism 
used to determine the mime type of the embedded documents?

I think the specific formats I'm interested in are not in widespread use, so 
for contributions to Tika I'm rather focused on a generic solution. Getting the 
storage class IDs will definitely be useful in such cases. If custom mime types 
worked for embedded documents that could also be useful.


> Include the storage class ID of documents embedded in MS Office documents
> -
>
> Key: TIKA-1854
> URL: https://issues.apache.org/jira/browse/TIKA-1854
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
>Assignee: Tim Allison
> Attachments: class-id.patch
>
>
> When processing embedded documents using an EmbeddedDocumentExtractor, the 
> storage class ID of the embedded document would be a useful metadata to have, 
> but it's currently missing.
> I'll promptly attach a patch implementing and testing this new feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents

2016-02-05 Thread Daniel Bonniot de Ruisselet (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Bonniot de Ruisselet updated TIKA-1854:
--
Attachment: class-id.patch

> Include the storage class ID of documents embedded in MS Office documents
> -
>
> Key: TIKA-1854
> URL: https://issues.apache.org/jira/browse/TIKA-1854
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
> Attachments: class-id.patch
>
>
> When processing embedded documents using an EmbeddedDocumentExtractor, the 
> storage class ID of the embedded document would be a useful metadata to have, 
> but it's currently missing.
> I'll promptly attach a patch implementing and testing this new feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents

2016-02-05 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133916#comment-15133916
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1854:
---

By the way, the Content-Type of the embedded document IS already available, but 
this only works for some popular formats (e.g. embedded MS Office documents). 
Is there a way for clients to configure the Content-Type detection for more 
exotic formats?


> Include the storage class ID of documents embedded in MS Office documents
> -
>
> Key: TIKA-1854
> URL: https://issues.apache.org/jira/browse/TIKA-1854
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
> Attachments: class-id.patch
>
>
> When processing embedded documents using an EmbeddedDocumentExtractor, the 
> storage class ID of the embedded document would be a useful metadata to have, 
> but it's currently missing.
> I'll promptly attach a patch implementing and testing this new feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents

2016-02-05 Thread Daniel Bonniot de Ruisselet (JIRA)
Daniel Bonniot de Ruisselet created TIKA-1854:
-

 Summary: Include the storage class ID of documents embedded in MS 
Office documents
 Key: TIKA-1854
 URL: https://issues.apache.org/jira/browse/TIKA-1854
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Daniel Bonniot de Ruisselet


When processing embedded documents using an EmbeddedDocumentExtractor, the 
storage class ID of the embedded document would be a useful metadata to have, 
but it's currently missing.

I'll promptly attach a patch implementing and testing this new feature.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1017) DefaultHtmlMapper misses some safe elements

2015-03-04 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346702#comment-14346702
 ] 

Daniel Bonniot de Ruisselet edited comment on TIKA-1017 at 3/4/15 10:36 AM:


If we want to preserve the semantics, maybe at least SUB and SUP should be 
added? For instance in a scientific document, "ab" and 
"ab" might be different concepts, which are lost if you only get "a 
b".

If we want to keep all "safe" elements, we could also add at least I, B, EM and 
STRONG.

It's easy enough to use a custom mapper, so this is not a huge issue, but a 
good default is always nice. Given the above, maybe only add SUB and SUP?



was (Author: dbr):
If we want to preserve the semantics, maybe at least SUB and SUP should be 
added? For instance in a scientific document, "ab" and 
"ab" might be different concepts, which are lost if you only get "a 
b".

If we want to keep all "safe" elements, we could also add at least I, B, EM and 
STRONG.

It's easy enough to use another mapper, so this is not a huge issue, but a good 
default is always nice. Given the above, maybe only add SUB and SUP?


> DefaultHtmlMapper misses some safe elements
> ---
>
> Key: TIKA-1017
> URL: https://issues.apache.org/jira/browse/TIKA-1017
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
>
> The code of DefaultHtmlMapper says that the list of "safe" elements is based 
> on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> Elements like  and  are not included in the safe list. Is this 
> intentional (a comment with the rationale would be useful) or should they be 
> added?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1017) DefaultHtmlMapper misses some safe elements

2015-03-04 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346702#comment-14346702
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1017:
---

If we want to preserve the semantics, maybe at least SUB and SUP should be 
added? For instance in a scientific document, "ab" and 
"ab" might be different concepts, which are lost if you only get "a 
b".

If we want to keep all "safe" elements, we could also add at least I, B, EM and 
STRONG.

It's easy enough to use another mapper, so this is not a huge issue, but a good 
default is always nice. Given the above, maybe only add SUB and SUP?


> DefaultHtmlMapper misses some safe elements
> ---
>
> Key: TIKA-1017
> URL: https://issues.apache.org/jira/browse/TIKA-1017
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
>
> The code of DefaultHtmlMapper says that the list of "safe" elements is based 
> on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> Elements like  and  are not included in the safe list. Is this 
> intentional (a comment with the rationale would be useful) or should they be 
> added?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1167) Embedded object not extracted

2013-08-28 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752209#comment-13752209
 ] 

Daniel Bonniot de Ruisselet edited comment on TIKA-1167 at 8/28/13 11:47 AM:
-

After further analysis, I think support for such cases probably needs to be 
done in POI (but comments welcome if someone has further insight). I posted 
comments and tentative a patch to this POI bug: 
https://issues.apache.org/bugzilla/show_bug.cgi?id=51891

Even if that works out well, it would probably be useful to add a test at the 
Tika level as well. The OLE parsing seems rather sensitive (for a reason, the 
format itself looks messy and poorly documented). Also, integration of POI and 
Tika is seems tight. So it can only help to test things work at different 
levels.

  was (Author: dbr):
After further analysis, I think support for such cases probably needs to be 
done in POI (but comments welcome if someone has further insight). I'm working 
on submitting an issue and probably a tentative a patch there. Will link to it 
here when it exists.

Even if that works out well, it would probably be useful to add a test at the 
Tika level as well. The OLE parsing seems rather sensitive (for a reason, the 
format itself looks messy and poorly documented). Also, integration of POI and 
Tika is seems tight. So it can only help to test things work at different 
levels.
  
> Embedded object not extracted
> -
>
> Key: TIKA-1167
> URL: https://issues.apache.org/jira/browse/TIKA-1167
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.5
>
> Attachments: Doc w Structure that wont extract.docx
>
>
> For the attached docx, tika seems to detect the embedded object, as shown by 
> this tag:
> {{}}
> However, extraction itself (using -z on the command line, or using the API) 
> does not seem to work for this object:
> {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
> {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to 
> /tmp/tika/rId9_image1.wmf}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1167) Embedded object not extracted

2013-08-28 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752209#comment-13752209
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1167:
---

After further analysis, I think support for such cases probably needs to be 
done in POI (but comments welcome if someone has further insight). I'm working 
on submitting an issue and probably a tentative a patch there. Will link to it 
here when it exists.

Even if that works out well, it would probably be useful to add a test at the 
Tika level as well. The OLE parsing seems rather sensitive (for a reason, the 
format itself looks messy and poorly documented). Also, integration of POI and 
Tika is seems tight. So it can only help to test things work at different 
levels.

> Embedded object not extracted
> -
>
> Key: TIKA-1167
> URL: https://issues.apache.org/jira/browse/TIKA-1167
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.5
>
> Attachments: Doc w Structure that wont extract.docx
>
>
> For the attached docx, tika seems to detect the embedded object, as shown by 
> this tag:
> {{}}
> However, extraction itself (using -z on the command line, or using the API) 
> does not seem to work for this object:
> {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
> {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to 
> /tmp/tika/rId9_image1.wmf}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1167) Embedded object not extracted

2013-08-27 Thread Daniel Bonniot de Ruisselet (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Bonniot de Ruisselet updated TIKA-1167:
--

Attachment: Doc w Structure that wont extract.docx

> Embedded object not extracted
> -
>
> Key: TIKA-1167
> URL: https://issues.apache.org/jira/browse/TIKA-1167
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.5
>
> Attachments: Doc w Structure that wont extract.docx
>
>
> For the attached docx, tika seems to detect the embedded object, as shown by 
> this tag:
> {{}}
> However, extraction itself (using -z on the command line, or using the API) 
> does not seem to work for this object:
> {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
> {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to 
> /tmp/tika/rId9_image1.wmf}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1167) Embedded object not extracted

2013-08-27 Thread Daniel Bonniot de Ruisselet (JIRA)
Daniel Bonniot de Ruisselet created TIKA-1167:
-

 Summary: Embedded object not extracted
 Key: TIKA-1167
 URL: https://issues.apache.org/jira/browse/TIKA-1167
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.5
 Attachments: Doc w Structure that wont extract.docx

For the attached docx, tika seems to detect the embedded object, as shown by 
this tag:

{{}}

However, extraction itself (using -z on the command line, or using the API) 
does not seem to work for this object:

{{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
{{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to 
/tmp/tika/rId9_image1.wmf}}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694659#comment-13694659
 ] 

Daniel Bonniot de Ruisselet edited comment on TIKA-1109 at 6/27/13 12:27 PM:
-

I tried it. It broke two tests (same cause): as you mentioned, in excel the 
metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a 
change in how that is implemented, and:

{{[INFO] 
}}
{{[INFO] Building Apache Tika 1.5-SNAPSHOT}}
{{[INFO] 
}}
{{[INFO]}}
{{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}}
{{[INFO] 
}}
{{[INFO] Reactor Summary:}}
{{[INFO]}}
{{[INFO] Apache Tika parent  SUCCESS [0.806s]}}
{{[INFO] Apache Tika core .. SUCCESS [8.418s]}}
{{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}}
{{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}}
{{[INFO] Apache Tika application ... SUCCESS [3.336s]}}
{{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}}
{{[INFO] Apache Tika server  SUCCESS [5.312s]}}
{{[INFO] Apache Tika ... SUCCESS [0.014s]}}
{{[INFO] 
}}
{{[INFO] BUILD SUCCESS}}
{{[INFO] 
}}
{{[INFO] Total time: 47.498s}}
{{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}}
{{[INFO] Final Memory: 27M/1930M}}
{{[INFO] 
}}

{{dbonniot@naming:~/world/tika$ svn diff | diffstat}}
{{main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java   
|   11 -}}
{{main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
 |   36 ++}}
{{test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java 
|   56 ++}}
{{3 files changed, 74 insertions(+), 29 deletions(-)}}

{{dbonniot@naming:~/world/tika$ svn diff > /tmp/TIKA-1109.patch}}


The logic is OOXMLExtractorFactory is now simpler, since I could remove the 
extra shielding, which I suppose was made necessary by the previous ordering.

And the metadata for OOXML formats is now available at parse time, as tested by 
the added test to OOXMLParserTest :)


  was (Author: dbr):
I tried it. It broke two tests (same cause): as you mentioned, in excel the 
metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a 
change in how that is implemented, and:

{{[INFO] 
}}
{{[INFO] Building Apache Tika 1.5-SNAPSHOT}}
{{[INFO] 
}}
{{[INFO]}}
{{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}}
{{[INFO] 
}}
{{[INFO] Reactor Summary:}}
{{[INFO]}}
{{[INFO] Apache Tika parent  SUCCESS [0.806s]}}
{{[INFO] Apache Tika core .. SUCCESS [8.418s]}}
{{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}}
{{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}}
{{[INFO] Apache Tika application ... SUCCESS [3.336s]}}
{{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}}
{{[INFO] Apache Tika server  SUCCESS [5.312s]}}
{{[INFO] Apache Tika ... SUCCESS [0.014s]}}
{{[INFO] 
}}
{{[INFO] BUILD SUCCESS}}
{{[INFO] 
}}
{{[INFO] Total time: 47.498s}}
{{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}}
{{[INFO] Final Memory: 27M/1930M}}
{{[INFO] 
}}

{{dbonniot@naming:~/world/tika$ svn diff | diffstat}}
{{ main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java  
 |   11 -}}
{{ 
main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
 |   36 ++}}
{{ test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
 |   56 ++}}
{{ 3 files changed, 74 insertions(+), 29 deletions(-)}}

{{dbonniot@naming:~/world/tika$ svn diff > /tmp/TIKA-1109.patch}}


The logic is OOXMLExtractorFactory is now sim

[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Bonniot de Ruisselet updated TIKA-1109:
--

Attachment: TIKA-1109.patch

> Metadata not extracted before the content in OOXML (pptx)
> -
>
> Key: TIKA-1109
> URL: https://issues.apache.org/jira/browse/TIKA-1109
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.5
>
> Attachments: TIKA-1109.patch
>
>
> It seems that when processing OOXML documents, the metadata is only read 
> after the text. This means it's impossible to use the medata while processing 
> the text. I think it would be more useful to have the metadata populated 
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> while there is more medata in the file (e.g. Attachment 
> Test).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694659#comment-13694659
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1109:
---

I tried it. It broke two tests (same cause): as you mentioned, in excel the 
metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a 
change in how that is implemented, and:

{{[INFO] 
}}
{{[INFO] Building Apache Tika 1.5-SNAPSHOT}}
{{[INFO] 
}}
{{[INFO]}}
{{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}}
{{[INFO] 
}}
{{[INFO] Reactor Summary:}}
{{[INFO]}}
{{[INFO] Apache Tika parent  SUCCESS [0.806s]}}
{{[INFO] Apache Tika core .. SUCCESS [8.418s]}}
{{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}}
{{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}}
{{[INFO] Apache Tika application ... SUCCESS [3.336s]}}
{{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}}
{{[INFO] Apache Tika server  SUCCESS [5.312s]}}
{{[INFO] Apache Tika ... SUCCESS [0.014s]}}
{{[INFO] 
}}
{{[INFO] BUILD SUCCESS}}
{{[INFO] 
}}
{{[INFO] Total time: 47.498s}}
{{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}}
{{[INFO] Final Memory: 27M/1930M}}
{{[INFO] 
}}

{{dbonniot@naming:~/world/tika$ svn diff | diffstat}}
{{ main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java  
 |   11 -}}
{{ 
main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
 |   36 ++}}
{{ test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
 |   56 ++}}
{{ 3 files changed, 74 insertions(+), 29 deletions(-)}}

{{dbonniot@naming:~/world/tika$ svn diff > /tmp/TIKA-1109.patch}}


The logic is OOXMLExtractorFactory is now simpler, since I could remove the 
extra shielding, which I suppose was made necessary by the previous ordering.

And the metadata for OOXML formats is now available at parse time, as tested by 
the added test to OOXMLParserTest :)


> Metadata not extracted before the content in OOXML (pptx)
> -
>
> Key: TIKA-1109
> URL: https://issues.apache.org/jira/browse/TIKA-1109
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.5
>
>
> It seems that when processing OOXML documents, the metadata is only read 
> after the text. This means it's impossible to use the medata while processing 
> the text. I think it would be more useful to have the metadata populated 
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> while there is more medata in the file (e.g. Attachment 
> Test).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Bonniot de Ruisselet updated TIKA-1109:
--

Summary: Metadata not extracted before the content in OOXML (pptx)  (was: 
Metadata not extracted before the context in OOXML (pptx))

> Metadata not extracted before the content in OOXML (pptx)
> -
>
> Key: TIKA-1109
> URL: https://issues.apache.org/jira/browse/TIKA-1109
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.5
>
>
> It seems that when processing OOXML documents, the metadata is only read 
> after the text. This means it's impossible to use the medata while processing 
> the text. I think it would be more useful to have the metadata populated 
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> while there is more medata in the file (e.g. Attachment 
> Test).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

2013-06-26 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694521#comment-13694521
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1109:
---

Nick, thanks a lot for your explanation. If I understand correctly, what you 
are saying is that in general it cannot be guaranteed that the metadata is 
available during parsing, since that will depend on the format whether that's 
possible or not. That makes complete sense.

Here I am asking specifically about the OOXML formats, with an example pptx 
file. As I understand the OOXML formats are zip files containing xml files. In 
test-classes/test-documents/testPPT.pptx, the metadata seems to be inside 
docProps/core.xml. Would it be possible to read the metadata first from there, 
before starting the parsing?


> Metadata not extracted before the context in OOXML (pptx)
> -
>
> Key: TIKA-1109
> URL: https://issues.apache.org/jira/browse/TIKA-1109
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.5
>
>
> It seems that when processing OOXML documents, the metadata is only read 
> after the text. This means it's impossible to use the medata while processing 
> the text. I think it would be more useful to have the metadata populated 
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> while there is more medata in the file (e.g. Attachment 
> Test).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

2013-04-18 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635096#comment-13635096
 ] 

Daniel Bonniot de Ruisselet edited comment on TIKA-1109 at 4/18/13 11:24 AM:
-

Nick, thanks a lot for your answer. I do use the API, and I see the same 
behaviour: when my ContentHandler is called on the text data, the metadata is 
not set (yet). It is only set when endDocument is called. Am I doing something 
wrong?

Looking at:
http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java?revision=1339390&view=markup

and specifically:

108 // We need to get the content first, but not end 
109 //  the document just yet
110 EndDocumentShieldingContentHandler handler = 
111new EndDocumentShieldingContentHandler(baseHandler);
112 extractor.getXHTML(handler, metadata, context);
113 
114 // Now we can get the metadata
115 extractor.getMetadataExtractor().extract(metadata);

It seems to me that metadata is read only after the text, which explains the 
behaviour. Why is this needed? Am I misunderstanding something?


  was (Author: dbr):
Nick, thanks a lot for your answer. I do use the API, and I see the same 
behaviour: when my ContentHandler is called on the text data, the metadata is 
not set (yet). It is only set when endDocument is called. Am I doing something 
wrong?

Looking at:
http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java?revision=1339390&view=markup

and specifically:

  
> Metadata not extracted before the context in OOXML (pptx)
> -
>
> Key: TIKA-1109
> URL: https://issues.apache.org/jira/browse/TIKA-1109
> Project: Tika
>  Issue Type: Bug
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.4
>
>
> It seems that when processing OOXML documents, the metadata is only read 
> after the text. This means it's impossible to use the medata while processing 
> the text. I think it would be more useful to have the metadata populated 
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> while there is more medata in the file (e.g. Attachment 
> Test).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

2013-04-18 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635096#comment-13635096
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1109:
---

Nick, thanks a lot for your answer. I do use the API, and I see the same 
behaviour: when my ContentHandler is called on the text data, the metadata is 
not set (yet). It is only set when endDocument is called. Am I doing something 
wrong?

Looking at:
http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java?revision=1339390&view=markup

and specifically:


> Metadata not extracted before the context in OOXML (pptx)
> -
>
> Key: TIKA-1109
> URL: https://issues.apache.org/jira/browse/TIKA-1109
> Project: Tika
>  Issue Type: Bug
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.4
>
>
> It seems that when processing OOXML documents, the metadata is only read 
> after the text. This means it's impossible to use the medata while processing 
> the text. I think it would be more useful to have the metadata populated 
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> while there is more medata in the file (e.g. Attachment 
> Test).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

2013-04-18 Thread Daniel Bonniot de Ruisselet (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Bonniot de Ruisselet updated TIKA-1109:
--

Description: 
It seems that when processing OOXML documents, the metadata is only read after 
the text. This means it's impossible to use the medata while processing the 
text. I think it would be more useful to have the metadata populated first.

As a symptom:

java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx

outputs only as metadata:





while there is more medata in the file (e.g. Attachment 
Test).


  was:
It seems that when processing OOXML documents, the metadata is only read after 
the text. This means it's impossible to use the medata while processing the 
text. I think it would be more useful to have the metadata populated first.

As a symptom:

java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx

outputs only as metadata:



> Metadata not extracted before the context in OOXML (pptx)
> -
>
> Key: TIKA-1109
> URL: https://issues.apache.org/jira/browse/TIKA-1109
> Project: Tika
>  Issue Type: Bug
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.4
>
>
> It seems that when processing OOXML documents, the metadata is only read 
> after the text. This means it's impossible to use the medata while processing 
> the text. I think it would be more useful to have the metadata populated 
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> while there is more medata in the file (e.g. Attachment 
> Test).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

2013-04-18 Thread Daniel Bonniot de Ruisselet (JIRA)
Daniel Bonniot de Ruisselet created TIKA-1109:
-

 Summary: Metadata not extracted before the context in OOXML (pptx)
 Key: TIKA-1109
 URL: https://issues.apache.org/jira/browse/TIKA-1109
 Project: Tika
  Issue Type: Bug
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.4


It seems that when processing OOXML documents, the metadata is only read after 
the text. This means it's impossible to use the medata while processing the 
text. I think it would be more useful to have the metadata populated first.

As a symptom:

java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx

outputs only as metadata:


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1108) Represent individual slides in pptx

2013-04-17 Thread Daniel Bonniot de Ruisselet (JIRA)
Daniel Bonniot de Ruisselet created TIKA-1108:
-

 Summary: Represent individual slides in pptx
 Key: TIKA-1108
 URL: https://issues.apache.org/jira/browse/TIKA-1108
 Project: Tika
  Issue Type: Improvement
Reporter: Daniel Bonniot de Ruisselet
 Fix For: 1.4


When parsing ppt, tika produces for each slide:


However for pptx these seem to be missing, all the text is directly under 
.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1017) DefaultHtmlMapper misses some safe elements

2012-11-06 Thread Daniel Bonniot de Ruisselet (JIRA)
Daniel Bonniot de Ruisselet created TIKA-1017:
-

 Summary: DefaultHtmlMapper misses some safe elements
 Key: TIKA-1017
 URL: https://issues.apache.org/jira/browse/TIKA-1017
 Project: Tika
  Issue Type: Bug
Reporter: Daniel Bonniot de Ruisselet


The code of DefaultHtmlMapper says that the list of "safe" elements is based on 
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

Elements like  and  are not included in the safe list. Is this 
intentional (a comment with the rationale would be useful) or should they be 
added?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-820) Locator is unset for HTML parser

2012-09-21 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460573#comment-13460573
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-820:
--

Hi Ken - Thanks for looking at the patch. I have no idea if this is the only 
missing delegating call, it just seemed wrong to me not to do it in 
TextContentHandler.

> Locator is unset for HTML parser
> 
>
> Key: TIKA-820
> URL: https://issues.apache.org/jira/browse/TIKA-820
> Project: Tika
>  Issue Type: Bug
>  Components: general, parser
>Affects Versions: 1.0
>Reporter: Daniel Bonniot de Ruisselet
>Assignee: Ken Krugler
>  Labels: patch
> Fix For: 1.3
>
> Attachments: text-locator.patch
>
>
> The HtmlParser does not call setDocumentLocator(Locator locator) on the 
> user's content handler.
> Patch and unit test attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-946) Improve how the PPTX parser uses XLSF from POI

2012-09-11 Thread Daniel Bonniot de Ruisselet (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452973#comment-13452973
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-946:
--

Does it also belong to this task that the output would represent the structures 
of slides (one  element per slide)?

> Improve how the PPTX parser uses XLSF from POI
> --
>
> Key: TIKA-946
> URL: https://issues.apache.org/jira/browse/TIKA-946
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
>Reporter: Nick Burch
>
> One last bit from TIKA-757 and TIKA-805 - the current way that PPTX files are 
> parsed using XSLF from Apache POI has a couple of last remaining low level 
> parts.
> We should avoid the need to go from the usermodel XMLSlideShow to the low 
> level XSLFSlideShow to do the text extraction (occurs in 
> XSLFPowerPointExtractorDecorator).
> We should also update the usermodel slide support to extract out the slide 
> names from docProps/app.xml, so that these can be included in the text output 
> easily (in XSLFPowerPointExtractor)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira