RE: pre-release 1.13 regression testing

2016-04-26 Thread Allison, Timothy B.
I looked at the results and found a new NPE, which I've fixed in TIKA-1894. Aside from the known increase in PDF exceptions (because of the diff in how PDFBox 2.0's parser handles truncated files and how PDFBox 1.x's parser handled them), there are a few areas for investigation, but nothing

[jira] [Updated] (TIKA-1960) Put legacy language detection code back into 1.x=trunk

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1960: -- Component/s: languageidentifier > Put legacy language detection code back into 1.x=trunk >

[jira] [Updated] (TIKA-1960) Put legacy language detection code back into 1.x=trunk

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1960: -- Summary: Put legacy language detection code back into 1.x=trunk (was: Put classic language detection

[jira] [Assigned] (TIKA-1960) Put legacy language detection code back into 1.x=trunk

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1960: - Assignee: Tim Allison > Put legacy language detection code back into 1.x=trunk >

[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

2016-04-26 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259290#comment-15259290 ] Hudson commented on TIKA-1894: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #971 (See

[jira] [Comment Edited] (TIKA-1959) Upgrade to PDFBox 2.0.1/JempBox 1.8.12

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259262#comment-15259262 ] Tim Allison edited comment on TIKA-1959 at 4/27/16 12:32 AM: - Committed with

[jira] [Resolved] (TIKA-1959) Upgrade to PDFBox 2.0.1/JempBox 1.8.12

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1959. --- Resolution: Fixed Fix Version/s: 1.13 Committed with incorrect commit message...argh. >

[jira] [Created] (TIKA-1959) Upgrade to PDFBox 2.0.1/JempBox 1.8.12

2016-04-26 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1959: - Summary: Upgrade to PDFBox 2.0.1/JempBox 1.8.12 Key: TIKA-1959 URL: https://issues.apache.org/jira/browse/TIKA-1959 Project: Tika Issue Type: Improvement

[jira] [Resolved] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1894. --- Resolution: Fixed > Add XMPMM metadata extraction to JempboxExtractor >

[jira] [Reopened] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1894: --- NPE discovered during TIKA-1302 regression tests in prep for 1.13 release. > Add XMPMM metadata

[jira] [Updated] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1894: -- Fix Version/s: 1.13 > Add XMPMM metadata extraction to JempboxExtractor >

RE: pre-release 1.13 regression testing

2016-04-26 Thread Allison, Timothy B.
Results are here: http://162.242.228.174/reports/tika_1_12_v_tika_1_13-SNAPSHOTv3.tar.bz2 I haven't had a chance to look. I stopped the run slightly early because of time constraints. I made further modifications based on an OOM related to TIKA-1924 and committed those this morning. Should

[jira] [Commented] (TIKA-1958) Add mime detection and lightweight parsers for Office 2003 Word and Excel formats

2016-04-26 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258429#comment-15258429 ] Nick Burch commented on TIKA-1958: -- On the detection, can't remember, probably best just try + unit test!

[jira] [Comment Edited] (TIKA-1958) Add mime detection and lightweight parsers for Office 2003 Word and Excel formats

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258357#comment-15258357 ] Tim Allison edited comment on TIKA-1958 at 4/26/16 4:40 PM: Y, that's what I

[jira] [Commented] (TIKA-1958) Add mime detection and lightweight parsers for Office 2003 Word and Excel formats

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258357#comment-15258357 ] Tim Allison commented on TIKA-1958: --- Y, that's what I was thinking. For mime detection, can we specify

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-26 Thread Anthony Beylerian
Please check this approach [1] it could be useful to combine a labeled seed set with unlabeled Fisher CallHome. Since it maybe a long read there's a shorter ppt as well [2] [1] link.springer.com/article/10.1023%2FA%3A1007692713085 [2] cseweb.ucsd.edu/~atsmith/presentation_final.ppt On Tue, Apr

[jira] [Commented] (TIKA-1924) Upgrade com.googlecode.mp4parser's isoparser to 1.1.18

2016-04-26 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258206#comment-15258206 ] Hudson commented on TIKA-1924: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #970 (See

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-26 Thread Anthony Beylerian
sentiment analysis discussion doc : https://docs.google.com/document/d/1Gi59YqtisY4NLaVY3B7CNLMTgCRZm9JEk17kmBmWXqQ/edit?usp=sharing On Tue, Apr 26, 2016 at 10:56 PM, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hi, > > Sure here is the link: > >

[jira] [Updated] (TIKA-1958) Add mime detection and lightweight parsers for Office 2003 Word and Excel formats

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1958: -- Attachment: 2010-cal-eu.xls Original file submitted via link in POI user's mail. > Add mime detection

[jira] [Updated] (TIKA-1958) Add mime detection and lightweight parsers for Office 2003 Word and Excel formats

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1958: -- Attachment: excel_msword_2003.tar.bz2 Output of grep on our corpus as it is today. We have several

[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258147#comment-15258147 ] Tim Allison commented on TIKA-1513: --- Great. Frankly, the initial regex looked quite good...small handful

RE: pre-release 1.13 regression testing

2016-04-26 Thread Allison, Timothy B.
> Are the tests hosted and executed on the Infra hosted VM? I don't think we're using the infra-hosted vm for anything at the moment. The regression testing and corpus is all happening on our Rackspace server. We have roughly 3 million/1TB of files. The corpus is in constant flux, though,

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-26 Thread Mattmann, Chris A (3980)
Hi, Sure here is the link: https://hangouts.google.com/call/a2w5cgdtirf6jgfb4ww5l2l64ee Sorry for the delay. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA

RE: pre-release 1.13 regression testing

2016-04-26 Thread Allison, Timothy B.
Hi Lewis, Y, they are on the vm. The first pre-pre-comparisons were placed here: http://162.242.228.174/reports/tika_1_12_v_tika_1_13-SNAPSHOTv2.tar.bz2 I announced this to the dev list and on twitter... One quick and dirty metric (recommended by Tilman Hausherr over on PDFBox) is to sum

Re: pre-release 1.13 regression testing

2016-04-26 Thread Lewis John Mcgibbney
Hi Tim, What does this consist of? Are the tests hosted and executed on the Infra hosted VM? It would be great to see what the outcome of integration tests are... I've never seen this before and it would be very helpful for making a positive case for upgrading Tika in projects such as Solr cf.