Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-08-31 Thread Mattmann, Chris A (3980)
Ok thanks 

Sent from my iPhone

On Aug 31, 2014, at 1:35 PM, "Tyler Palsulich"  wrote:

>> Commit it to trunk and then yes
> Already in there (thanks, Nick!).


Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-08-31 Thread Tyler Palsulich
>Commit it to trunk and then yes
Already in there (thanks, Nick!).


Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-08-31 Thread Mattmann, Chris A (3980)
Commit it to trunk and then yes 

Sent from my iPhone

> On Aug 31, 2014, at 1:11 PM, "Tyler Palsulich"  wrote:
> 
> Can we get TIKA-1404 in 1.6? Simple, but significant, fix.
> 
> Tyler
> On Aug 31, 2014 3:54 PM, "Mattmann, Chris A (3980)" <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> 
>> Ugh, sorry. Maven release plugin issues, going to have to clean some
>> stuff up here. Don't mind me folks.
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>> 
>> 
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: , Chris Mattmann 
>> Reply-To: "dev@tika.apache.org" 
>> Date: Sunday, August 31, 2014 12:37 PM
>> To: "dev@tika.apache.org" 
>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>> 
>>> OK RC #2 coming up shortly, just brought the branch up to date in
>>> r1621623. Also cleaned up JIRA.
>>> 
>>> Here goes..
>>> 
>>> ++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattm...@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -Original Message-
>>> From: , Chris Mattmann 
>>> Date: Thursday, July 31, 2014 11:29 AM
>>> To: "dev@tika.apache.org" 
>>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>> 
>>>> Guys, based on all the comments here, I am going to roll another
>>>> RC #2 to address:
>>>> 
>>>> - Tyler's comment about getting the MicrosoftTranslator fix incorporated.
>>>> - Dave's Lingo24 API plugin for translate
>>>> - Nick's POI updates
>>>> 
>>>> I'll roll another RC #2 probably on Monday.
>>>> 
>>>> Thanks!
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> P.S. When I do, I'll diff trunk against the branch and then roll any
>>>> trunk updates post branch to 1.6 into the new 1.6 RC #2.
>>>> 
>>>> ++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattm...@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -Original Message-
>>>> From: , Chris Mattmann 
>>>> Reply-To: "dev@tika.apache.org" 
>>>> Date: Monday, July 28, 2014 11:45 AM
>>>> To: "dev@tika.apache.org" 
>>>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>>> 
>>>>> Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>>>>> thread for a few weeks about getting 1.6 out. Do you have a patch right
>>>>> now for TIKA-1367? If so I'm happy to incorporate it a

Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-08-31 Thread Tyler Palsulich
Can we get TIKA-1404 in 1.6? Simple, but significant, fix.

Tyler
On Aug 31, 2014 3:54 PM, "Mattmann, Chris A (3980)" <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Ugh, sorry. Maven release plugin issues, going to have to clean some
> stuff up here. Don't mind me folks.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: , Chris Mattmann 
> Reply-To: "dev@tika.apache.org" 
> Date: Sunday, August 31, 2014 12:37 PM
> To: "dev@tika.apache.org" 
> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
> >OK RC #2 coming up shortly, just brought the branch up to date in
> >r1621623. Also cleaned up JIRA.
> >
> >Here goes..
> >
> >++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattm...@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> >-Original Message-
> >From: , Chris Mattmann 
> >Date: Thursday, July 31, 2014 11:29 AM
> >To: "dev@tika.apache.org" 
> >Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
> >
> >>Guys, based on all the comments here, I am going to roll another
> >>RC #2 to address:
> >>
> >>- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
> >>- Dave's Lingo24 API plugin for translate
> >>- Nick's POI updates
> >>
> >>I'll roll another RC #2 probably on Monday.
> >>
> >>Thanks!
> >>
> >>Cheers,
> >>Chris
> >>
> >>P.S. When I do, I'll diff trunk against the branch and then roll any
> >>trunk updates post branch to 1.6 into the new 1.6 RC #2.
> >>
> >>++
> >>Chris Mattmann, Ph.D.
> >>Chief Architect
> >>Instrument Software and Science Data Systems Section (398)
> >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>Office: 168-519, Mailstop: 168-527
> >>Email: chris.a.mattm...@nasa.gov
> >>WWW:  http://sunset.usc.edu/~mattmann/
> >>++
> >>Adjunct Associate Professor, Computer Science Department
> >>University of Southern California, Los Angeles, CA 90089 USA
> >>++
> >>
> >>
> >>
> >>
> >>
> >>
> >>-Original Message-
> >>From: , Chris Mattmann 
> >>Reply-To: "dev@tika.apache.org" 
> >>Date: Monday, July 28, 2014 11:45 AM
> >>To: "dev@tika.apache.org" 
> >>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
> >>
> >>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
> >>>thread for a few weeks about getting 1.6 out. Do you have a patch right
> >>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
> >>>to get it in. If you don't have a patch yet, would you mind terribly if
> >>>we pushed out 1.6, which already today has a ton of great updates, then
> >>>shortly thereafter rolled a 1.7 (or did so when you finished with
> >>>TIKA-1367)?
> >>>
> >>>Cheers,
> >>>Chris
> >>>
> >>>
> >>>++++++
> >>>Chris Mattmann, Ph.D.
> >>>Chief Architect
> >&

Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-08-31 Thread Mattmann, Chris A (3980)
Ugh, sorry. Maven release plugin issues, going to have to clean some
stuff up here. Don't mind me folks.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: , Chris Mattmann 
Reply-To: "dev@tika.apache.org" 
Date: Sunday, August 31, 2014 12:37 PM
To: "dev@tika.apache.org" 
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>OK RC #2 coming up shortly, just brought the branch up to date in
>r1621623. Also cleaned up JIRA.
>
>Here goes..
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>
>-Original Message-
>From: , Chris Mattmann 
>Date: Thursday, July 31, 2014 11:29 AM
>To: "dev@tika.apache.org" 
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>Guys, based on all the comments here, I am going to roll another
>>RC #2 to address:
>>
>>- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
>>- Dave's Lingo24 API plugin for translate
>>- Nick's POI updates
>>
>>I'll roll another RC #2 probably on Monday.
>>
>>Thanks!
>>
>>Cheers,
>>Chris
>>
>>P.S. When I do, I'll diff trunk against the branch and then roll any
>>trunk updates post branch to 1.6 into the new 1.6 RC #2.
>>
>>++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattm...@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++
>>
>>
>>
>>
>>
>>
>>-Original Message-
>>From: , Chris Mattmann 
>>Reply-To: "dev@tika.apache.org" 
>>Date: Monday, July 28, 2014 11:45 AM
>>To: "dev@tika.apache.org" 
>>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>
>>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>>>thread for a few weeks about getting 1.6 out. Do you have a patch right
>>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>>>to get it in. If you don't have a patch yet, would you mind terribly if
>>>we pushed out 1.6, which already today has a ton of great updates, then
>>>shortly thereafter rolled a 1.7 (or did so when you finished with
>>>TIKA-1367)?
>>>
>>>Cheers,
>>>Chris
>>>
>>>
>>>++
>>>Chris Mattmann, Ph.D.
>>>Chief Architect
>>>Instrument Software and Science Data Systems Section (398)
>>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>Office: 168-519, Mailstop: 168-527
>>>Email: chris.a.mattm...@nasa.gov
>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>++
>>>Adjunct Associate Professor, Computer Science Department
>>>University of Southern California, Los Angeles, CA 90089 USA
>>>++
>>>
>>>
>>>
>>>
>>>
>>>
>>>-Original Message-
>>>From: Sergey Beryozkin 
>>>Re

Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-08-31 Thread Mattmann, Chris A (3980)
OK RC #2 coming up shortly, just brought the branch up to date in
r1621623. Also cleaned up JIRA.

Here goes..

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: , Chris Mattmann 
Date: Thursday, July 31, 2014 11:29 AM
To: "dev@tika.apache.org" 
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>Guys, based on all the comments here, I am going to roll another
>RC #2 to address:
>
>- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
>- Dave's Lingo24 API plugin for translate
>- Nick's POI updates
>
>I'll roll another RC #2 probably on Monday.
>
>Thanks!
>
>Cheers,
>Chris
>
>P.S. When I do, I'll diff trunk against the branch and then roll any
>trunk updates post branch to 1.6 into the new 1.6 RC #2.
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>
>-Original Message-----
>From: , Chris Mattmann 
>Reply-To: "dev@tika.apache.org" 
>Date: Monday, July 28, 2014 11:45 AM
>To: "dev@tika.apache.org" 
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>>thread for a few weeks about getting 1.6 out. Do you have a patch right
>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>>to get it in. If you don't have a patch yet, would you mind terribly if
>>we pushed out 1.6, which already today has a ton of great updates, then
>>shortly thereafter rolled a 1.7 (or did so when you finished with
>>TIKA-1367)?
>>
>>Cheers,
>>Chris
>>
>>
>>++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattm...@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++
>>
>>
>>
>>
>>
>>
>>-Original Message-
>>From: Sergey Beryozkin 
>>Reply-To: "dev@tika.apache.org" 
>>Date: Monday, July 28, 2014 11:38 AM
>>To: "dev@tika.apache.org" 
>>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>
>>>+0 given that it appears that the tika-parsers dependencies
>>>documentation issue has been pushed away. I'm getting confused why.
>>>
>>>Thanks. Sergey
>>>
>>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>>
>>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>>> +1
>>>>
>>>> OSX 10.9.3, Java 1.7
>>>>
>>>> Tyler
>>>>
>>>>
>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>>> Windows 7, Java 1.7
>>>>>
>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>>docs
>>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>>yielding
>>>>> 10,413 docs.  There were several improvements in text extraction 

RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-31 Thread Allison, Timothy B.
Nick,
  Just to be clear -- that wasn't a veiled complaint that you hadn't cut the 
3.11-beta!  I really just have not had a chance to start the run with my local 
build of poi-trunk.
  Thank you, as always!

Best,

Tim

-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org] 
Sent: Thursday, July 31, 2014 3:06 PM
To: dev@tika.apache.org
Subject: RE: [VOTE] Apache Tika 1.6 release candidate #1

On Thu, 31 Jul 2014, Allison, Timothy B. wrote:
>  On a related note, I did some digging on the one regression I found in 
> the pptx, and that will be solved if we wait for POI 3.11 beta 1.  I 
> haven't yet had a chance to rerun on the random sample with the updated 
> POI...

I'm currently on a train to France, but fingers crossed I'll be able to 
upload the POI 3.11 beta 1 artifacts for you to test with before I run out 
of English mobile phone signal...

Nick


RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-31 Thread Nick Burch

On Thu, 31 Jul 2014, Allison, Timothy B. wrote:
 On a related note, I did some digging on the one regression I found in 
the pptx, and that will be solved if we wait for POI 3.11 beta 1.  I 
haven't yet had a chance to rerun on the random sample with the updated 
POI...


I'm currently on a train to France, but fingers crossed I'll be able to 
upload the POI 3.11 beta 1 artifacts for you to test with before I run out 
of English mobile phone signal...


Nick


RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-31 Thread Allison, Timothy B.
All,
  On a related note, I did some digging on the one regression I found in the 
pptx, and that will be solved if we wait for POI 3.11 beta 1.  I haven't yet 
had a chance to rerun on the random sample with the updated POI...  

 Best,

   Tim

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Thursday, July 31, 2014 2:30 PM
To: dev@tika.apache.org
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

Guys, based on all the comments here, I am going to roll another
RC #2 to address:

- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
- Dave's Lingo24 API plugin for translate
- Nick's POI updates

I'll roll another RC #2 probably on Monday.

Thanks!

Cheers,
Chris

P.S. When I do, I'll diff trunk against the branch and then roll any
trunk updates post branch to 1.6 into the new 1.6 RC #2.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: , Chris Mattmann 
Reply-To: "dev@tika.apache.org" 
Date: Monday, July 28, 2014 11:45 AM
To: "dev@tika.apache.org" 
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>thread for a few weeks about getting 1.6 out. Do you have a patch right
>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>to get it in. If you don't have a patch yet, would you mind terribly if
>we pushed out 1.6, which already today has a ton of great updates, then
>shortly thereafter rolled a 1.7 (or did so when you finished with
>TIKA-1367)?
>
>Cheers,
>Chris
>
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>
>-Original Message-
>From: Sergey Beryozkin 
>Reply-To: "dev@tika.apache.org" 
>Date: Monday, July 28, 2014 11:38 AM
>To: "dev@tika.apache.org" 
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>+0 given that it appears that the tika-parsers dependencies
>>documentation issue has been pushed away. I'm getting confused why.
>>
>>Thanks. Sergey
>>
>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>
>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>> +1
>>>
>>> OSX 10.9.3, Java 1.7
>>>
>>> Tyler
>>>
>>>
>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>> Windows 7, Java 1.7
>>>>
>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>docs
>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>yielding
>>>> 10,413 docs.  There were several improvements in text extraction for
>>>>PDFs
>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>
>>>> There was one regression:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>
>>>> Stacktrace:
>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>>of
>>>> range: -369073454
>>>>  at java.lang.String.checkBounds(String.java:371)
>>>>  at java.lang.String.(String.java:415)
>>>>  at
>>>> 
>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java
>>>>:
>>>>114)
>>&g

Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-31 Thread Nick Burch
Another quick thought on the artifiacts in 
http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ - as well as 
needing to ditch original-tika-app.jar, shouldn't we have the Tika 
Server standalone jar in there too as another released + easily 
downloadable jar?


Thanks
Nick

On 28/07/14 05:22, Mattmann, Chris A (3980) wrote:

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


The release candidate is a zip archive of the sources in:

 http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/


Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

 [ ] +1 Release this package as Apache Tika 1.6
 [ ] -1 Do not release this package becauseŠ

Thank you!

Cheers,
Chris

P.S. Here is my +1!









Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-31 Thread Mattmann, Chris A (3980)
Guys, based on all the comments here, I am going to roll another
RC #2 to address:

- Tyler's comment about getting the MicrosoftTranslator fix incorporated.
- Dave's Lingo24 API plugin for translate
- Nick's POI updates

I'll roll another RC #2 probably on Monday.

Thanks!

Cheers,
Chris

P.S. When I do, I'll diff trunk against the branch and then roll any
trunk updates post branch to 1.6 into the new 1.6 RC #2.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: , Chris Mattmann 
Reply-To: "dev@tika.apache.org" 
Date: Monday, July 28, 2014 11:45 AM
To: "dev@tika.apache.org" 
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>thread for a few weeks about getting 1.6 out. Do you have a patch right
>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>to get it in. If you don't have a patch yet, would you mind terribly if
>we pushed out 1.6, which already today has a ton of great updates, then
>shortly thereafter rolled a 1.7 (or did so when you finished with
>TIKA-1367)?
>
>Cheers,
>Chris
>
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>
>-Original Message-----
>From: Sergey Beryozkin 
>Reply-To: "dev@tika.apache.org" 
>Date: Monday, July 28, 2014 11:38 AM
>To: "dev@tika.apache.org" 
>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>
>>+0 given that it appears that the tika-parsers dependencies
>>documentation issue has been pushed away. I'm getting confused why.
>>
>>Thanks. Sergey
>>
>>[1] https://issues.apache.org/jira/browse/TIKA-1367
>>
>>On 28/07/14 17:16, Tyler Palsulich wrote:
>>> +1
>>>
>>> OSX 10.9.3, Java 1.7
>>>
>>> Tyler
>>>
>>>
>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>> Windows 7, Java 1.7
>>>>
>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>docs
>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>yielding
>>>> 10,413 docs.  There were several improvements in text extraction for
>>>>PDFs
>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>
>>>> There was one regression:
>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>
>>>> Stacktrace:
>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>>of
>>>> range: -369073454
>>>>  at java.lang.String.checkBounds(String.java:371)
>>>>  at java.lang.String.(String.java:415)
>>>>  at
>>>> 
>>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java
>>>>:
>>>>114)
>>>>  at
>>>> 
>>>>org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163
>>>>)
>>>>  at
>>>> 
>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject
>>>>(
>>>>Ole10Native.java:91)
>>>>  at
>>>> 
>>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject
>>>>(
>>>>Ole10Native.java:63)
>>>>  at
>

Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-30 Thread Tyler Palsulich
Hi All,

After the recent NPE that Chris found (
https://issues.apache.org/jira/browse/TIKA-1378), we should roll an RC#2.

Tyler


On Wed, Jul 30, 2014 at 10:55 AM, Nick Burch  wrote:

> On Mon, 28 Jul 2014, Mattmann, Chris A (3980) wrote:
>
>> A candidate for the Tika 1.6 release is available at:
>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>
>
> Should original-tika-app-1.6.jar be in there? IIRC we decided in the 1.5
> release that it shouldn't be
>
>
>
>  Please vote on releasing this package as Apache Tika 1.6.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>>
>
> Otherwise I'm +1
>
> Nick
>


Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-30 Thread Nick Burch

On Mon, 28 Jul 2014, Mattmann, Chris A (3980) wrote:

A candidate for the Tika 1.6 release is available at:
http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


Should original-tika-app-1.6.jar be in there? IIRC we decided in the 1.5 
release that it shouldn't be




Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.


Otherwise I'm +1

Nick


Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Sergey Beryozkin

Hi
On 29/07/14 13:14, Nick Burch wrote:

On Mon, 28 Jul 2014, Sergey Beryozkin wrote:

This is not an issue that should block the release, I was careful not
to vote with a minus one. I've become a bit impatient, but no one
really blocks me from completing this pure documentation effort
myself, I was hoping that someone would do it first :-).


Given that this is a documentation / website enhancement, I don't see
any reason why we couldn't post the details for 1.6 (and even perhaps
1.5!) to the site in a few weeks time, irrespective of when the 1.6
release goes out :)

Yes, you are right,

Cheers, Sergey


Cheers
Nick





RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Nick Burch

On Mon, 28 Jul 2014, Allison, Timothy B. wrote:

There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx

Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
range: -369073454
at java.lang.String.checkBounds(String.java:371)
at java.lang.String.(String.java:415)
at 
org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
at 
org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163)


Any chance you could raise a POI bug for this? We're probably going to do 
the next POI beta release within a week, so if you hurry it might even get 
fixed in that... :)


Nick


Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Nick Burch

On Mon, 28 Jul 2014, Sergey Beryozkin wrote:
This is not an issue that should block the release, I was careful not to 
vote with a minus one. I've become a bit impatient, but no one really 
blocks me from completing this pure documentation effort myself, I was 
hoping that someone would do it first :-).


Given that this is a documentation / website enhancement, I don't see any 
reason why we couldn't post the details for 1.6 (and even perhaps 1.5!) to 
the site in a few weeks time, irrespective of when the 1.6 release goes 
out :)


Cheers
Nick


Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-28 Thread Mattmann, Chris A (3980)
Thank you Sergey! OK I will proceed. THanks for your contributions
to Tika and yes we'll get there

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Sergey Beryozkin 
Reply-To: "dev@tika.apache.org" 
Date: Monday, July 28, 2014 3:16 PM
To: "dev@tika.apache.org" 
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>Hi Chris,
>
>This is not an issue that should block the release, I was careful not to
>vote with a minus one. I've become a bit impatient, but no one really
>blocks me from completing this pure documentation effort myself, I was
>hoping that someone would do it first :-).
>
>Please go ahead with the release as planned, thanks for offering the
>chance to delay the release, but I can not go for it, we'll get there as
>far as the documentation is concerned :-)
>
>Thanks, Sergey
>
>On 28/07/14 21:45, Mattmann, Chris A (3980) wrote:
>> Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
>> thread for a few weeks about getting 1.6 out. Do you have a patch right
>> now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
>> to get it in. If you don't have a patch yet, would you mind terribly if
>> we pushed out 1.6, which already today has a ton of great updates, then
>> shortly thereafter rolled a 1.7 (or did so when you finished with
>> TIKA-1367)?
>>
>> Cheers,
>> Chris
>>
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Sergey Beryozkin 
>> Reply-To: "dev@tika.apache.org" 
>> Date: Monday, July 28, 2014 11:38 AM
>> To: "dev@tika.apache.org" 
>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1
>>
>>> +0 given that it appears that the tika-parsers dependencies
>>> documentation issue has been pushed away. I'm getting confused why.
>>>
>>> Thanks. Sergey
>>>
>>> [1] https://issues.apache.org/jira/browse/TIKA-1367
>>>
>>> On 28/07/14 17:16, Tyler Palsulich wrote:
>>>> +1
>>>>
>>>> OSX 10.9.3, Java 1.7
>>>>
>>>> Tyler
>>>>
>>>>
>>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>>> 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>>>> Windows 7, Java 1.7
>>>>>
>>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>>> docs
>>>>> (all formats) plus all available msoffice-x files in govdocs1,
>>>>>yielding
>>>>> 10,413 docs.  There were several improvements in text extraction for
>>>>> PDFs
>>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>>>
>>>>> There was one regression:
>>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>>>
>>>>> Stacktrace:
>>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index
>>>>>out
>>>>> of
>>>>> range: -369073454
>>>>>   at java.lang.String.checkBounds(String.java:371)
>>>>>   at java.lang.String.(String.java:415)
>>>>>   at
>>>>>
>>>>> 
>>>>>org.apache.poi.util.Str

Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-28 Thread Sergey Beryozkin

Hi Chris,

This is not an issue that should block the release, I was careful not to 
vote with a minus one. I've become a bit impatient, but no one really 
blocks me from completing this pure documentation effort myself, I was 
hoping that someone would do it first :-).


Please go ahead with the release as planned, thanks for offering the 
chance to delay the release, but I can not go for it, we'll get there as 
far as the documentation is concerned :-)


Thanks, Sergey

On 28/07/14 21:45, Mattmann, Chris A (3980) wrote:

Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
thread for a few weeks about getting 1.6 out. Do you have a patch right
now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
to get it in. If you don't have a patch yet, would you mind terribly if
we pushed out 1.6, which already today has a ton of great updates, then
shortly thereafter rolled a 1.7 (or did so when you finished with
TIKA-1367)?

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Sergey Beryozkin 
Reply-To: "dev@tika.apache.org" 
Date: Monday, July 28, 2014 11:38 AM
To: "dev@tika.apache.org" 
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1


+0 given that it appears that the tika-parsers dependencies
documentation issue has been pushed away. I'm getting confused why.

Thanks. Sergey

[1] https://issues.apache.org/jira/browse/TIKA-1367

On 28/07/14 17:16, Tyler Palsulich wrote:

+1

OSX 10.9.3, Java 1.7

Tyler


On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.

wrote:


+1

Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7

I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
docs
(all formats) plus all available msoffice-x files in govdocs1, yielding
10,413 docs.  There were several improvements in text extraction for
PDFs
(mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).

There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx

Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out
of
range: -369073454
  at java.lang.String.checkBounds(String.java:371)
  at java.lang.String.(String.java:415)
  at

org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:
114)
  at

org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163)
  at

org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
Ole10Native.java:91)
  at

org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
Ole10Native.java:63)
  at

org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
ddedOLE(AbstractOOXMLExtractor.java:250)
  at

org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
ddedParts(AbstractOOXMLExtractor.java:199)
  at

org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(A
bstractOOXMLExtractor.java:115)
  at

org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXML
ExtractorFactory.java:112)
  at

org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.jav
a:82)
  at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)


-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, July 28, 2014 12:22 AM
To: dev@tika.apache.org
Cc: u...@tika.apache.org
Subject: [VOTE] Apache Tika 1.6 release candidate #1

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


The release candidate is a zip archive of the sources in:

  http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/


Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

  [ ] +1 Release this package as Apache Tika 1.6
  [ ] -1 Do not release this package becauseŠ

Thank you!

Cheers,
Chris

P.S. Here is my +1!














Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-28 Thread Mattmann, Chris A (3980)
Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS
thread for a few weeks about getting 1.6 out. Do you have a patch right
now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2
to get it in. If you don't have a patch yet, would you mind terribly if
we pushed out 1.6, which already today has a ton of great updates, then
shortly thereafter rolled a 1.7 (or did so when you finished with
TIKA-1367)?

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Sergey Beryozkin 
Reply-To: "dev@tika.apache.org" 
Date: Monday, July 28, 2014 11:38 AM
To: "dev@tika.apache.org" 
Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1

>+0 given that it appears that the tika-parsers dependencies
>documentation issue has been pushed away. I'm getting confused why.
>
>Thanks. Sergey
>
>[1] https://issues.apache.org/jira/browse/TIKA-1367
>
>On 28/07/14 17:16, Tyler Palsulich wrote:
>> +1
>>
>> OSX 10.9.3, Java 1.7
>>
>> Tyler
>>
>>
>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
>>
>> wrote:
>>
>>> +1
>>>
>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
>>> Windows 7, Java 1.7
>>>
>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
>>>docs
>>> (all formats) plus all available msoffice-x files in govdocs1, yielding
>>> 10,413 docs.  There were several improvements in text extraction for
>>>PDFs
>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>>>
>>> There was one regression:
>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>>>
>>> Stacktrace:
>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>>>of
>>> range: -369073454
>>>  at java.lang.String.checkBounds(String.java:371)
>>>  at java.lang.String.(String.java:415)
>>>  at
>>> 
>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:
>>>114)
>>>  at
>>> 
>>>org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163)
>>>  at
>>> 
>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
>>>Ole10Native.java:91)
>>>  at
>>> 
>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(
>>>Ole10Native.java:63)
>>>  at
>>> 
>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
>>>ddedOLE(AbstractOOXMLExtractor.java:250)
>>>  at
>>> 
>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe
>>>ddedParts(AbstractOOXMLExtractor.java:199)
>>>  at
>>> 
>>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(A
>>>bstractOOXMLExtractor.java:115)
>>>  at
>>> 
>>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXML
>>>ExtractorFactory.java:112)
>>>  at
>>> 
>>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.jav
>>>a:82)
>>>  at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>>>
>>>
>>> -Original Message-
>>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>>> Sent: Monday, July 28, 2014 12:22 AM
>>> To: dev@tika.apache.org
>>> Cc: u...@tika.apache.org
>>> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>>>
>>> Hi Folks,
>>>
>>> A candidate for the Tika 1.6 release is available at:
>>>
>>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>>>
>>>
>>> The release candidate is a zip archive of the sources in:
>>>
>>>  http://svn.apache.org/repos/asf/tika/tags/1.6/
>>>
>>> The SHA1 checksum of the archive is
>>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>>>
>>> A Maven staging repository is available at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachetika-1003/
>>>
>>>
>>> Please vote on releasing this package as Apache Tika 1.6.
>>> The vote is open for the next 72 hours and passes if a majority of at
>>> least three +1 Tika PMC votes are cast.
>>>
>>>  [ ] +1 Release this package as Apache Tika 1.6
>>>  [ ] -1 Do not release this package becauseŠ
>>>
>>> Thank you!
>>>
>>> Cheers,
>>> Chris
>>>
>>> P.S. Here is my +1!
>>>
>>>
>>>
>>>
>>>
>>>
>>



Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-28 Thread Sergey Beryozkin
+0 given that it appears that the tika-parsers dependencies 
documentation issue has been pushed away. I'm getting confused why.


Thanks. Sergey

[1] https://issues.apache.org/jira/browse/TIKA-1367

On 28/07/14 17:16, Tyler Palsulich wrote:

+1

OSX 10.9.3, Java 1.7

Tyler


On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. 
wrote:


+1

Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7

I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs
(all formats) plus all available msoffice-x files in govdocs1, yielding
10,413 docs.  There were several improvements in text extraction for PDFs
(mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).

There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx

Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
range: -369073454
 at java.lang.String.checkBounds(String.java:371)
 at java.lang.String.(String.java:415)
 at
org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
 at
org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163)
 at
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
 at
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
 at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
 at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
 at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
 at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
 at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)


-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, July 28, 2014 12:22 AM
To: dev@tika.apache.org
Cc: u...@tika.apache.org
Subject: [VOTE] Apache Tika 1.6 release candidate #1

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


The release candidate is a zip archive of the sources in:

 http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/


Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

 [ ] +1 Release this package as Apache Tika 1.6
 [ ] -1 Do not release this package becauseŠ

Thank you!

Cheers,
Chris

P.S. Here is my +1!










Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-28 Thread Tyler Palsulich
+1

OSX 10.9.3, Java 1.7

Tyler


On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. 
wrote:

> +1
>
> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
> Windows 7, Java 1.7
>
> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs
> (all formats) plus all available msoffice-x files in govdocs1, yielding
> 10,413 docs.  There were several improvements in text extraction for PDFs
> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).
>
> There was one regression:
> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx
>
> Stacktrace:
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
> range: -369073454
> at java.lang.String.checkBounds(String.java:371)
> at java.lang.String.(String.java:415)
> at
> org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
> at
> org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163)
> at
> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
> at
> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
>
>
> -Original Message-
> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> Sent: Monday, July 28, 2014 12:22 AM
> To: dev@tika.apache.org
> Cc: u...@tika.apache.org
> Subject: [VOTE] Apache Tika 1.6 release candidate #1
>
> Hi Folks,
>
> A candidate for the Tika 1.6 release is available at:
>
> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>
>
> The release candidate is a zip archive of the sources in:
>
> http://svn.apache.org/repos/asf/tika/tags/1.6/
>
> The SHA1 checksum of the archive is
> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>
> A Maven staging repository is available at:
>
> https://repository.apache.org/content/repositories/orgapachetika-1003/
>
>
> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.6
> [ ] -1 Do not release this package becauseŠ
>
> Thank you!
>
> Cheers,
> Chris
>
> P.S. Here is my +1!
>
>
>
>
>
>


RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-28 Thread Allison, Timothy B.
+1

Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7

I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all 
formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs. 
 There were several improvements in text extraction for PDFs (mostly spacing) 
and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).

There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx 

Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
range: -369073454
at java.lang.String.checkBounds(String.java:371)
at java.lang.String.(String.java:415)
at 
org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
at 
org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163)
at 
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
at 
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)


-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Monday, July 28, 2014 12:22 AM
To: dev@tika.apache.org
Cc: u...@tika.apache.org
Subject: [VOTE] Apache Tika 1.6 release candidate #1

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


The release candidate is a zip archive of the sources in:

http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/


Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.6
[ ] -1 Do not release this package becauseŠ

Thank you!

Cheers,
Chris

P.S. Here is my +1!







Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-27 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 1.6.

Tested on the following systems:
1. Microsoft Windows 7 Enterprise, SP 1, x64-based PC
2. Linux ubuntu 3.11.0-24-generic #42-Ubuntu SMP x86_64 GNU/Linux

Thanks,
Oleg



On Mon, Jul 28, 2014 at 7:22 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Folks,
>
> A candidate for the Tika 1.6 release is available at:
>
> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/
>
>
> The release candidate is a zip archive of the sources in:
>
> http://svn.apache.org/repos/asf/tika/tags/1.6/
>
> The SHA1 checksum of the archive is
> 076ad343be56a540a4c8e395746fa4fda5b5b6d3.
>
> A Maven staging repository is available at:
>
> https://repository.apache.org/content/repositories/orgapachetika-1003/
>
>
> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.6
> [ ] -1 Do not release this package becauseŠ
>
> Thank you!
>
> Cheers,
> Chris
>
> P.S. Here is my +1!
>
>
>
>
>
>