I looked at the results and found a new NPE, which I've fixed in TIKA-1894.
Aside from the known increase in PDF exceptions (because of the diff in how
PDFBox 2.0's parser handles truncated files and how PDFBox 1.x's parser handled
them), there are a few areas for investigation, but nothing
[
https://issues.apache.org/jira/browse/TIKA-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1960:
--
Component/s: languageidentifier
> Put legacy language detection code back into 1.x=trunk
>
[
https://issues.apache.org/jira/browse/TIKA-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1960:
--
Summary: Put legacy language detection code back into 1.x=trunk (was: Put
classic language detection
[
https://issues.apache.org/jira/browse/TIKA-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison reassigned TIKA-1960:
-
Assignee: Tim Allison
> Put legacy language detection code back into 1.x=trunk
>
[
https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259290#comment-15259290
]
Hudson commented on TIKA-1894:
--
SUCCESS: Integrated in tika-trunk-jdk1.7 #971 (See
[
https://issues.apache.org/jira/browse/TIKA-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259262#comment-15259262
]
Tim Allison edited comment on TIKA-1959 at 4/27/16 12:32 AM:
-
Committed with
[
https://issues.apache.org/jira/browse/TIKA-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-1959.
---
Resolution: Fixed
Fix Version/s: 1.13
Committed with incorrect commit message...argh.
>
Tim Allison created TIKA-1959:
-
Summary: Upgrade to PDFBox 2.0.1/JempBox 1.8.12
Key: TIKA-1959
URL: https://issues.apache.org/jira/browse/TIKA-1959
Project: Tika
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-1894.
---
Resolution: Fixed
> Add XMPMM metadata extraction to JempboxExtractor
>
[
https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison reopened TIKA-1894:
---
NPE discovered during TIKA-1302 regression tests in prep for 1.13 release.
> Add XMPMM metadata
[
https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1894:
--
Fix Version/s: 1.13
> Add XMPMM metadata extraction to JempboxExtractor
>
Results are here:
http://162.242.228.174/reports/tika_1_12_v_tika_1_13-SNAPSHOTv3.tar.bz2
I haven't had a chance to look.
I stopped the run slightly early because of time constraints.
I made further modifications based on an OOM related to TIKA-1924 and committed
those this morning.
Should
[
https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258429#comment-15258429
]
Nick Burch commented on TIKA-1958:
--
On the detection, can't remember, probably best just try + unit test!
[
https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258357#comment-15258357
]
Tim Allison edited comment on TIKA-1958 at 4/26/16 4:40 PM:
Y, that's what I
[
https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258357#comment-15258357
]
Tim Allison commented on TIKA-1958:
---
Y, that's what I was thinking. For mime detection, can we specify
Please check this approach [1] it could be useful to combine
a labeled seed set with unlabeled Fisher CallHome.
Since it maybe a long read there's a shorter ppt as well [2]
[1] link.springer.com/article/10.1023%2FA%3A1007692713085
[2] cseweb.ucsd.edu/~atsmith/presentation_final.ppt
On Tue, Apr
[
https://issues.apache.org/jira/browse/TIKA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258206#comment-15258206
]
Hudson commented on TIKA-1924:
--
SUCCESS: Integrated in tika-trunk-jdk1.7 #970 (See
sentiment analysis discussion doc :
https://docs.google.com/document/d/1Gi59YqtisY4NLaVY3B7CNLMTgCRZm9JEk17kmBmWXqQ/edit?usp=sharing
On Tue, Apr 26, 2016 at 10:56 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:
> Hi,
>
> Sure here is the link:
>
>
[
https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1958:
--
Attachment: 2010-cal-eu.xls
Original file submitted via link in POI user's mail.
> Add mime detection
[
https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1958:
--
Attachment: excel_msword_2003.tar.bz2
Output of grep on our corpus as it is today. We have several
[
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258147#comment-15258147
]
Tim Allison commented on TIKA-1513:
---
Great. Frankly, the initial regex looked quite good...small handful
> Are the tests hosted and executed on the Infra hosted VM?
I don't think we're using the infra-hosted vm for anything at the moment. The
regression testing and corpus is all happening on our Rackspace server. We
have roughly 3 million/1TB of files. The corpus is in constant flux, though,
Hi,
Sure here is the link:
https://hangouts.google.com/call/a2w5cgdtirf6jgfb4ww5l2l64ee
Sorry for the delay.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA
Hi Lewis,
Y, they are on the vm. The first pre-pre-comparisons were placed here:
http://162.242.228.174/reports/tika_1_12_v_tika_1_13-SNAPSHOTv2.tar.bz2
I announced this to the dev list and on twitter...
One quick and dirty metric (recommended by Tilman Hausherr over on PDFBox) is
to sum
Hi Tim,
What does this consist of? Are the tests hosted and executed on the Infra
hosted VM?
It would be great to see what the outcome of integration tests are... I've
never seen this before and it would be very helpful for making a positive
case for upgrading Tika in projects such as Solr cf.
25 matches
Mail list logo