[jira] [Commented] (TIKA-853) java.io.IOException with TikaGUI and testMP4.m4a

2012-01-31 Thread John Mastarone (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197564#comment-13197564 ] John Mastarone commented on TIKA-853: - I tried debugging but I couldn't see what was hol

[jira] [Issue Comment Edited] (TIKA-842) IPTC Properties Should be Defined Completely and Independently of the Drew Library

2012-01-31 Thread Ray Gauss II (Issue Comment Edited) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197361#comment-13197361 ] Ray Gauss II edited comment on TIKA-842 at 1/31/12 11:06 PM: - If

[jira] [Issue Comment Edited] (TIKA-842) IPTC Properties Should be Defined Completely and Independently of the Drew Library

2012-01-31 Thread Ray Gauss II (Issue Comment Edited) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197361#comment-13197361 ] Ray Gauss II edited comment on TIKA-842 at 1/31/12 11:06 PM: - If

[jira] [Commented] (TIKA-842) IPTC Properties Should be Defined Completely and Independently of the Drew Library

2012-01-31 Thread Ray Gauss II (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197361#comment-13197361 ] Ray Gauss II commented on TIKA-842: --- If we're going to have Metadata implement all metadat

Re: % of different content types out there on the web

2012-01-31 Thread Markus Jelsma
On Tuesday 31 January 2012 15:55:06 Mattmann, Chris A (388J) wrote: > Hi Markus, > > Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes > compared to the size of the entire corpus? Unfortunately no, we don't keep record of those, just filter them away as soon as wel can.

Re: % of different content types out there on the web

2012-01-31 Thread Mattmann, Chris A (388J)
Hi Markus, Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes compared to the size of the entire corpus? Cheers, Chris On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote: > We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data > on > those two. Howev

[jira] [Resolved] (TIKA-854) No text extraction for Word macroenabled template

2012-01-31 Thread Maxim Valyanskiy (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Valyanskiy resolved TIKA-854. --- Resolution: Fixed Fix Version/s: 1.1 > No text extraction for Word macroenabled temp

[jira] [Commented] (TIKA-850) Consistent way to supply document passwords to parsers

2012-01-31 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196953#comment-13196953 ] Nick Burch commented on TIKA-850: - PasswordProvider added in r1238616, based on the above de

[jira] [Commented] (TIKA-854) No text extraction for Word macroenabled template

2012-01-31 Thread Antoni Mylka (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196952#comment-13196952 ] Antoni Mylka commented on TIKA-854: --- Remember TIKA-560. It's best if media types are all l

[jira] [Updated] (TIKA-854) No text extraction for Word macroenabled template

2012-01-31 Thread Maxim Valyanskiy (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Valyanskiy updated TIKA-854: -- Summary: No text extraction for Word macroenabled template (was: No text extraction Word macroen

[jira] [Updated] (TIKA-854) No text extraction Word macroenabled template

2012-01-31 Thread Maxim Valyanskiy (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Valyanskiy updated TIKA-854: -- Attachment: cat50.dotm test data > No text extraction Word macroenabled template

[jira] [Created] (TIKA-854) No text extraction Word macroenabled template

2012-01-31 Thread Maxim Valyanskiy (Created) (JIRA)
No text extraction Word macroenabled template - Key: TIKA-854 URL: https://issues.apache.org/jira/browse/TIKA-854 Project: Tika Issue Type: Bug Affects Versions: 1.1 Reporter: Maxim

Re: % of different content types out there on the web

2012-01-31 Thread Markus Jelsma
We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on those two. However, we also explicitly filter out all/most unwanted suffixes. We do have a lot of suffixes that we encountered so far. On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote: > (sorry for

[jira] [Commented] (TIKA-853) java.io.IOException with TikaGUI and testMP4.m4a

2012-01-31 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196841#comment-13196841 ] Nick Burch commented on TIKA-853: - I've looked at the code again, and I can't spot anything