Re: Appending Mime Types

2011-08-23 Thread Tom Grant
The Parser conflict Jira issue I was referring to is TIKA-527 (Allow override mapping mime<-->parsers through config). We would need something similar for mime types. The updates to MimeTypesFactory might address TIKA-87 (MimeTypes should allow modif

Re: Appending Mime Types

2011-08-23 Thread Tom Grant
Nick, I'm happy to do the work and contribute it as a patch. I guess I'm just looking for advice on the approach to ensure that what I provide does actually get incorporated. My particular use case is solved by adding the update methods to MimeTypesFactory (See last message), but I'm in a scenari

[jira] [Created] (TIKA-697) Tika reports the content type of AR archives as "text/plain"

2011-08-23 Thread PNS (JIRA)
Tika reports the content type of AR archives as "text/plain" Key: TIKA-697 URL: https://issues.apache.org/jira/browse/TIKA-697 Project: Tika Issue Type: Bug Environment: L

[jira] [Commented] (TIKA-648) Parsing HTML anchors with embedded div faulty

2011-08-23 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089506#comment-13089506 ] Ken Krugler commented on TIKA-648: -- I think this should be closed, and an improvement reque

[jira] [Commented] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089501#comment-13089501 ] Nick Burch commented on TIKA-696: - That should be fairly easy to add for .docx, for .doc may

[jira] [Commented] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089498#comment-13089498 ] Julien Nioche commented on TIKA-696: The text of the watermark can be found towards the

[jira] [Updated] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-696: --- Attachment: Demo+with+watermark.docx .docx version generated with MS Office Can't see the watermark wi

[jira] [Commented] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089480#comment-13089480 ] Julien Nioche commented on TIKA-696: Can't see the watermark when saving and reopening t

[jira] [Updated] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-696: --- Attachment: Demo with watermark.doc Attached doc file containing a watermark > Extract watermarks from

[jira] [Created] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
Extract watermarks from Word documents -- Key: TIKA-696 URL: https://issues.apache.org/jira/browse/TIKA-696 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 0.9

Re: Appending Mime Types

2011-08-23 Thread Antoni Mylka
W dniu 2011-08-22 20:37, Tom Grant pisze: Here's the use case that I'm attempting to solve. I have a customer with many legacy systems, some of which are completely custom. These systems have data files that will never be seen outside of their environment. For example, some are XML files with

buildbot success in ASF Buildbot on tika-trunk

2011-08-23 Thread buildbot
The Buildbot has detected a restored build on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/456 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: isis_ubuntu Build Reason: scheduler Build Source Stamp

[jira] [Commented] (TIKA-648) Parsing HTML anchors with embedded div faulty

2011-08-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089410#comment-13089410 ] Markus Jelsma commented on TIKA-648: Thanks. I assume this is not something that needs t

[jira] [Commented] (TIKA-676) Boilerpipe fails

2011-08-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089408#comment-13089408 ] Markus Jelsma commented on TIKA-676: Makes sense, thanks! > Boilerpipe fails >

[jira] [Commented] (TIKA-676) Boilerpipe fails

2011-08-23 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089401#comment-13089401 ] Jukka Zitting commented on TIKA-676: See [1] for why can't/shouldn't depend on external

Re: Appending Mime Types

2011-08-23 Thread Nick Burch
On Mon, 22 Aug 2011, Tom Grant wrote: Here's the use case that I'm attempting to solve. I have a customer with many legacy systems, some of which are completely custom. These systems have data files that will never be seen outside of their environment. For example, some are XML files with th

[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

2011-08-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089389#comment-13089389 ] Nick Burch commented on TIKA-694: - For some parsers it may be possible to skip some parts if

[jira] [Commented] (TIKA-676) Boilerpipe fails

2011-08-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089385#comment-13089385 ] Markus Jelsma commented on TIKA-676: I've asked Christian to push it to central but he a

buildbot failure in ASF Buildbot on tika-trunk

2011-08-23 Thread buildbot
The Buildbot has detected a new failure on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/455 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: isis_ubuntu Build Reason: scheduler Build Source Stamp: [

[jira] [Commented] (TIKA-434) Bug in TagSoup causes IOException

2011-08-23 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089369#comment-13089369 ] Jukka Zitting commented on TIKA-434: TagSoup 1.2.1 is finally available, so in revision

[jira] [Updated] (TIKA-695) Custom properties on xlsx, docx, pptx

2011-08-23 Thread Etienne Jouvin (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Etienne Jouvin updated TIKA-695: Description: Parser on office Xfiles do not get custom properties. In class MetadataExtractor, metho

[jira] [Created] (TIKA-695) Custom properties on xlsx, docx, pptx

2011-08-23 Thread Etienne Jouvin (JIRA)
Custom properties on xlsx, docx, pptx - Key: TIKA-695 URL: https://issues.apache.org/jira/browse/TIKA-695 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Envir

[jira] [Created] (TIKA-694) On extraction, get properties AND / OR content extraction

2011-08-23 Thread Etienne Jouvin (JIRA)
On extraction, get properties AND / OR content extraction - Key: TIKA-694 URL: https://issues.apache.org/jira/browse/TIKA-694 Project: Tika Issue Type: Wish Components: parser