Re: Gearing up for Tika 0.8

2010-10-26 Thread reinhard schwab
Am 26.10.2010 19:03, schrieb Ken Krugler: > > On Oct 21, 2010, at 12:28pm, Jukka Zitting wrote: > >> Hi, >> >> We're planning to release Jackrabbit 2.2 at the end of November, and >> it would be great to have Tika 0.8 out by then for use as a >> dependency. Ideally I'd like to see 0.8 out within th

Re: Google's Compact Language Detector

2011-10-26 Thread reinhard schwab
i have also compared tika performance with the nutch language detector in version 1.0. it seems that nutch is far better in performance than tika ( 5 to 6 times faster than nutch). but my use case is so special (short texts ~ 140 characters length) and i dont have time to investigate, so i have no

[jira] [Created] (TIKA-1500) FeedParser extracts XML markup with BodyContentHandler

2014-12-21 Thread Reinhard Schwab (JIRA)
Reinhard Schwab created TIKA-1500: - Summary: FeedParser extracts XML markup with BodyContentHandler Key: TIKA-1500 URL: https://issues.apache.org/jira/browse/TIKA-1500 Project: Tika Issue

[jira] [Updated] (TIKA-1500) FeedParser extracts XML markup with BodyContentHandler

2014-12-21 Thread Reinhard Schwab (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reinhard Schwab updated TIKA-1500: -- Attachment: TIKA-1500.patch Patch, which contains the trivial fix. > FeedParser extracts

[jira] Created: (TIKA-532) missing spaces in text extraction of BodyContentHandler

2010-10-16 Thread Reinhard Schwab (JIRA)
Reporter: Reinhard Schwab Fix For: 0.8 BodyContentHandler works fine to extract the text from pages, except this page: http://www.lucidimagination.com/developers/whitepapers/whats-new-solr-14 there is a selection, the text returned by BodyContentHandler contains "...Co

[jira] Created: (TIKA-539) Encoding detection is too biased by encoding in meta tag

2010-10-26 Thread Reinhard Schwab (JIRA)
Reporter: Reinhard Schwab Fix For: 0.8 if the encoding in the meta tag is wrong, this encoding is detected, even if there is the right encoding set in metadata before(which can be from http response header). test code to reproduce: static String content = &qu

[jira] Updated: (TIKA-539) Encoding detection is too biased by encoding in meta tag

2010-10-26 Thread Reinhard Schwab (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reinhard Schwab updated TIKA-539: - Attachment: TIKA-539.patch > Encoding detection is too biased by encoding in meta

[jira] Updated: (TIKA-539) Encoding detection is too biased by encoding in meta tag

2010-10-26 Thread Reinhard Schwab (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reinhard Schwab updated TIKA-539: - Attachment: TIKA-539_2.patch ignore my first version of the patch. the encoding detection in the

[jira] Commented: (TIKA-539) Encoding detection is too biased by encoding in meta tag

2010-10-26 Thread Reinhard Schwab (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925288#action_12925288 ] Reinhard Schwab commented on TIKA-539: -- hi ken, in other words: it trusts the se

[jira] Commented: (TIKA-548) PDF content extracted as single line

2010-11-28 Thread Reinhard Schwab (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964590#action_12964590 ] Reinhard Schwab commented on TIKA-548: -- there is still a regression there: i miss

[jira] Updated: (TIKA-548) PDF content extracted as single line

2010-11-28 Thread Reinhard Schwab (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reinhard Schwab updated TIKA-548: - Attachment: test.pdf this is a sample pdf document to reproduce the regression. > PDF cont

[jira] Commented: (TIKA-548) PDF content extracted as single line

2010-11-28 Thread Reinhard Schwab (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964592#action_12964592 ] Reinhard Schwab commented on TIKA-548: -- i have generated this document with openof