[jira] [Updated] (TIKA-2021) Improving accuracy of Tesseract parser

2016-06-24 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-2021: Component/s: parser ocr > Improving accuracy of Tesseract parser >

[jira] [Updated] (TIKA-2021) Improving accuracy of Tesseract parser

2016-06-24 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-2021: Fix Version/s: 1.14 > Improving accuracy of Tesseract parser >

[jira] [Assigned] (TIKA-2021) Improving accuracy of Tesseract parser

2016-06-24 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-2021: --- Assignee: Chris A. Mattmann > Improving accuracy of Tesseract parser >

[jira] [Updated] (TIKA-2021) Improving accuracy of Tesseract parser

2016-06-24 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-2021: Labels: memex (was: ) > Improving accuracy of Tesseract parser >

[jira] [Commented] (TIKA-2021) Improving accuracy of Tesseract parser

2016-06-24 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349029#comment-15349029 ] ASF GitHub Bot commented on TIKA-2021: -- GitHub user Zarana-Parekh opened a pull request:

[GitHub] tika pull request #126: fix for TIKA-2021 contributed by Zarana Parekh

2016-06-24 Thread Zarana-Parekh
GitHub user Zarana-Parekh opened a pull request: https://github.com/apache/tika/pull/126 fix for TIKA-2021 contributed by Zarana Parekh Improving accuracy of Tesseract for better extraction of numeric and alphanumeric text from images. You can merge this pull request into a Git

[jira] [Created] (TIKA-2021) Improving accuracy of Tesseract parser

2016-06-24 Thread Zarana Parekh (JIRA)
Zarana Parekh created TIKA-2021: --- Summary: Improving accuracy of Tesseract parser Key: TIKA-2021 URL: https://issues.apache.org/jira/browse/TIKA-2021 Project: Tika Issue Type: Improvement

[jira] [Commented] (TIKA-2018) Attempt to get Title from Full text if not present in MetaData ( Application/Pdf )

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348551#comment-15348551 ] Tim Allison commented on TIKA-2018: --- I'm not against implementing some basic heuristics based on font

[jira] [Commented] (TIKA-2020) Tika 2.0 - remove AbstractParser's 3 parameter parse

2016-06-24 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348546#comment-15348546 ] Hudson commented on TIKA-2020: -- SUCCESS: Integrated in tika-2.x #113 (See

[jira] [Commented] (TIKA-2018) Attempt to get Title from Full text if not present in MetaData ( Application/Pdf )

2016-06-24 Thread Florent Valdelievre (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348506#comment-15348506 ] Florent Valdelievre commented on TIKA-2018: --- Tika is doing a good job in getting Metadata when

[jira] [Commented] (TIKA-2020) Tika 2.0 - remove AbstractParser's 3 parameter parse

2016-06-24 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348496#comment-15348496 ] Hudson commented on TIKA-2020: -- FAILURE: Integrated in tika-2.x-windows #17 (See

tika-2.x-windows - Build # 17 - Still Failing

2016-06-24 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #17) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/17/ to view the results.

[jira] [Updated] (TIKA-2020) Tika 2.0 - remove AbstractParser's 3 parameter parse

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2020: -- Description: If I understand correctly, AbstractParser was added to allow an easier transition from the

[jira] [Resolved] (TIKA-2020) Tika 2.0 - remove AbstractParser's 3 parameter parse

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2020. --- Resolution: Fixed I initially thought we could remove the AbstractParser entirely, but that contains

[jira] [Updated] (TIKA-2020) Tika 2.0 - remove AbstractParser's 3 parameter parse

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2020: -- Fix Version/s: 2.0 > Tika 2.0 - remove AbstractParser's 3 parameter parse >

[jira] [Commented] (TIKA-2019) WordMLParser and SpreadsheetMLParser incorrectly concatenate tokens with ToTextHandler

2016-06-24 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348442#comment-15348442 ] Hudson commented on TIKA-2019: -- SUCCESS: Integrated in tika-2.x #112 (See

[jira] [Commented] (TIKA-2019) WordMLParser and SpreadsheetMLParser incorrectly concatenate tokens with ToTextHandler

2016-06-24 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348413#comment-15348413 ] Hudson commented on TIKA-2019: -- FAILURE: Integrated in tika-2.x-windows #16 (See

tika-2.x-windows - Build # 16 - Still Failing

2016-06-24 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #16) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/16/ to view the results.

[jira] [Updated] (TIKA-2020) Tika 2.0 - remove AbstractParser's 3 parameter parse

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2020: -- Summary: Tika 2.0 - remove AbstractParser's 3 parameter parse (was: Tika 2.0 - remove AbstractParser)

[jira] [Commented] (TIKA-2019) WordMLParser and SpreadsheetMLParser incorrectly concatenate tokens with ToTextHandler

2016-06-24 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348378#comment-15348378 ] Hudson commented on TIKA-2019: -- SUCCESS: Integrated in Tika-trunk #1069 (See

[jira] [Created] (TIKA-2020) Tika 2.0 - remove AbstractParser

2016-06-24 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2020: - Summary: Tika 2.0 - remove AbstractParser Key: TIKA-2020 URL: https://issues.apache.org/jira/browse/TIKA-2020 Project: Tika Issue Type: Task Reporter:

[jira] [Resolved] (TIKA-2019) WordMLParser and SpreadsheetMLParser incorrectly concatenate tokens with ToTextHandler

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2019. --- Resolution: Fixed Fix Version/s: 1.14 2.0 > WordMLParser and

[jira] [Updated] (TIKA-2019) WordMLParser and SpreadsheetMLParser incorrectly concatenate tokens with ToTextHandler

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2019: -- Description: The xml generated by these parsers was good, but when using the ToTextHandler, spaces/tabs

[jira] [Created] (TIKA-2019) WordMLParser and SpreadsheetMLParser incorrectly concatenate tokens with ToTextHandler

2016-06-24 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2019: - Summary: WordMLParser and SpreadsheetMLParser incorrectly concatenate tokens with ToTextHandler Key: TIKA-2019 URL: https://issues.apache.org/jira/browse/TIKA-2019

[jira] [Commented] (TIKA-2018) Attempt to get Title from Full text if not present in MetaData ( Application/Pdf )

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348286#comment-15348286 ] Tim Allison commented on TIKA-2018: --- bq. A vast majority of pdf documents don't fill meta information.

[jira] [Created] (TIKA-2018) Attempt to get Title from Full text if not present in MetaData ( Application/Pdf )

2016-06-24 Thread Florent Valdelievre (JIRA)
Florent Valdelievre created TIKA-2018: - Summary: Attempt to get Title from Full text if not present in MetaData ( Application/Pdf ) Key: TIKA-2018 URL: https://issues.apache.org/jira/browse/TIKA-2018

[jira] [Updated] (TIKA-2017) Tika Server Cannot handle large files; add option for metadata only

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2017: -- Summary: Tika Server Cannot handle large files; add option for metadata only (was: Tika Server Cannot

[jira] [Comment Edited] (TIKA-2017) Tika Server Cannot handle large files

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348153#comment-15348153 ] Tim Allison edited comment on TIKA-2017 at 6/24/16 11:18 AM: - I thought I had

[jira] [Commented] (TIKA-2017) Tika Server Cannot handle large files

2016-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348153#comment-15348153 ] Tim Allison commented on TIKA-2017: --- I thought I had documented this on our wiki, but it isn't there now.

[vm] mimes of files in our corpus

2016-06-24 Thread Allison, Timothy B.
Hi Dominik, As you mentioned, it is a pain for each of us to run mime-detection on the files in our corpus to select those we're interested in. This is somewhat out of date, but should be reasonable for now: http://162.242.228.174/mimes/mime_comparisons.html I'll dump mimes into a tab

[jira] [Commented] (TIKA-2017) Tika Server Cannot handle large files

2016-06-24 Thread Sergey Beryozkin (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348132#comment-15348132 ] Sergey Beryozkin commented on TIKA-2017: Might also be worth trying multiparts, I've updated the