[
https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris A. Mattmann updated TIKA-2021:
Component/s: parser
ocr
> Improving accuracy of Tesseract parser
>
[
https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris A. Mattmann updated TIKA-2021:
Fix Version/s: 1.14
> Improving accuracy of Tesseract parser
>
[
https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris A. Mattmann reassigned TIKA-2021:
---
Assignee: Chris A. Mattmann
> Improving accuracy of Tesseract parser
>
[
https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris A. Mattmann updated TIKA-2021:
Labels: memex (was: )
> Improving accuracy of Tesseract parser
>
[
https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349029#comment-15349029
]
ASF GitHub Bot commented on TIKA-2021:
--
GitHub user Zarana-Parekh opened a pull request:
GitHub user Zarana-Parekh opened a pull request:
https://github.com/apache/tika/pull/126
fix for TIKA-2021 contributed by Zarana Parekh
Improving accuracy of Tesseract for better extraction of numeric and
alphanumeric text from images.
You can merge this pull request into a Git
Zarana Parekh created TIKA-2021:
---
Summary: Improving accuracy of Tesseract parser
Key: TIKA-2021
URL: https://issues.apache.org/jira/browse/TIKA-2021
Project: Tika
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348551#comment-15348551
]
Tim Allison commented on TIKA-2018:
---
I'm not against implementing some basic heuristics based on font
[
https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348546#comment-15348546
]
Hudson commented on TIKA-2020:
--
SUCCESS: Integrated in tika-2.x #113 (See
[
https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348506#comment-15348506
]
Florent Valdelievre commented on TIKA-2018:
---
Tika is doing a good job in getting Metadata when
[
https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348496#comment-15348496
]
Hudson commented on TIKA-2020:
--
FAILURE: Integrated in tika-2.x-windows #17 (See
The Apache Jenkins build system has built tika-2.x-windows (build #17)
Status: Still Failing
Check console output at https://builds.apache.org/job/tika-2.x-windows/17/ to
view the results.
[
https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2020:
--
Description:
If I understand correctly, AbstractParser was added to allow an easier
transition from the
[
https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2020.
---
Resolution: Fixed
I initially thought we could remove the AbstractParser entirely, but that
contains
[
https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2020:
--
Fix Version/s: 2.0
> Tika 2.0 - remove AbstractParser's 3 parameter parse
>
[
https://issues.apache.org/jira/browse/TIKA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348442#comment-15348442
]
Hudson commented on TIKA-2019:
--
SUCCESS: Integrated in tika-2.x #112 (See
[
https://issues.apache.org/jira/browse/TIKA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348413#comment-15348413
]
Hudson commented on TIKA-2019:
--
FAILURE: Integrated in tika-2.x-windows #16 (See
The Apache Jenkins build system has built tika-2.x-windows (build #16)
Status: Still Failing
Check console output at https://builds.apache.org/job/tika-2.x-windows/16/ to
view the results.
[
https://issues.apache.org/jira/browse/TIKA-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2020:
--
Summary: Tika 2.0 - remove AbstractParser's 3 parameter parse (was: Tika
2.0 - remove AbstractParser)
[
https://issues.apache.org/jira/browse/TIKA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348378#comment-15348378
]
Hudson commented on TIKA-2019:
--
SUCCESS: Integrated in Tika-trunk #1069 (See
Tim Allison created TIKA-2020:
-
Summary: Tika 2.0 - remove AbstractParser
Key: TIKA-2020
URL: https://issues.apache.org/jira/browse/TIKA-2020
Project: Tika
Issue Type: Task
Reporter:
[
https://issues.apache.org/jira/browse/TIKA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2019.
---
Resolution: Fixed
Fix Version/s: 1.14
2.0
> WordMLParser and
[
https://issues.apache.org/jira/browse/TIKA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2019:
--
Description: The xml generated by these parsers was good, but when using
the ToTextHandler, spaces/tabs
Tim Allison created TIKA-2019:
-
Summary: WordMLParser and SpreadsheetMLParser incorrectly
concatenate tokens with ToTextHandler
Key: TIKA-2019
URL: https://issues.apache.org/jira/browse/TIKA-2019
[
https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348286#comment-15348286
]
Tim Allison commented on TIKA-2018:
---
bq. A vast majority of pdf documents don't fill meta information.
Florent Valdelievre created TIKA-2018:
-
Summary: Attempt to get Title from Full text if not present in
MetaData ( Application/Pdf )
Key: TIKA-2018
URL: https://issues.apache.org/jira/browse/TIKA-2018
[
https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2017:
--
Summary: Tika Server Cannot handle large files; add option for metadata
only (was: Tika Server Cannot
[
https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348153#comment-15348153
]
Tim Allison edited comment on TIKA-2017 at 6/24/16 11:18 AM:
-
I thought I had
[
https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348153#comment-15348153
]
Tim Allison commented on TIKA-2017:
---
I thought I had documented this on our wiki, but it isn't there now.
Hi Dominik,
As you mentioned, it is a pain for each of us to run mime-detection on the
files in our corpus to select those we're interested in.
This is somewhat out of date, but should be reasonable for now:
http://162.242.228.174/mimes/mime_comparisons.html
I'll dump mimes into a tab
[
https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348132#comment-15348132
]
Sergey Beryozkin commented on TIKA-2017:
Might also be worth trying multiparts, I've updated the
31 matches
Mail list logo