[
https://issues.apache.org/jira/browse/TIKA-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197564#comment-13197564
]
John Mastarone commented on TIKA-853:
-
I tried debugging but I couldn't see what was hol
[
https://issues.apache.org/jira/browse/TIKA-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197361#comment-13197361
]
Ray Gauss II edited comment on TIKA-842 at 1/31/12 11:06 PM:
-
If
[
https://issues.apache.org/jira/browse/TIKA-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197361#comment-13197361
]
Ray Gauss II edited comment on TIKA-842 at 1/31/12 11:06 PM:
-
If
[
https://issues.apache.org/jira/browse/TIKA-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197361#comment-13197361
]
Ray Gauss II commented on TIKA-842:
---
If we're going to have Metadata implement all metadat
On Tuesday 31 January 2012 15:55:06 Mattmann, Chris A (388J) wrote:
> Hi Markus,
>
> Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes
> compared to the size of the entire corpus?
Unfortunately no, we don't keep record of those, just filter them away as soon
as wel can.
Hi Markus,
Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes
compared
to the size of the entire corpus?
Cheers,
Chris
On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote:
> We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data
> on
> those two. Howev
[
https://issues.apache.org/jira/browse/TIKA-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maxim Valyanskiy resolved TIKA-854.
---
Resolution: Fixed
Fix Version/s: 1.1
> No text extraction for Word macroenabled temp
[
https://issues.apache.org/jira/browse/TIKA-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196953#comment-13196953
]
Nick Burch commented on TIKA-850:
-
PasswordProvider added in r1238616, based on the above de
[
https://issues.apache.org/jira/browse/TIKA-854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196952#comment-13196952
]
Antoni Mylka commented on TIKA-854:
---
Remember TIKA-560. It's best if media types are all l
[
https://issues.apache.org/jira/browse/TIKA-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maxim Valyanskiy updated TIKA-854:
--
Summary: No text extraction for Word macroenabled template (was: No text
extraction Word macroen
[
https://issues.apache.org/jira/browse/TIKA-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maxim Valyanskiy updated TIKA-854:
--
Attachment: cat50.dotm
test data
> No text extraction Word macroenabled template
No text extraction Word macroenabled template
-
Key: TIKA-854
URL: https://issues.apache.org/jira/browse/TIKA-854
Project: Tika
Issue Type: Bug
Affects Versions: 1.1
Reporter: Maxim
We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on
those two. However, we also explicitly filter out all/most unwanted suffixes.
We do have a lot of suffixes that we encountered so far.
On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
> (sorry for
[
https://issues.apache.org/jira/browse/TIKA-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196841#comment-13196841
]
Nick Burch commented on TIKA-853:
-
I've looked at the code again, and I can't spot anything
14 matches
Mail list logo