Re: [VOTE] Apache Tika 1.6 release candidate #1
Ok thanks Sent from my iPhone On Aug 31, 2014, at 1:35 PM, "Tyler Palsulich" wrote: >> Commit it to trunk and then yes > Already in there (thanks, Nick!).
Re: [VOTE] Apache Tika 1.6 release candidate #1
>Commit it to trunk and then yes Already in there (thanks, Nick!).
Re: [VOTE] Apache Tika 1.6 release candidate #1
Commit it to trunk and then yes Sent from my iPhone > On Aug 31, 2014, at 1:11 PM, "Tyler Palsulich" wrote: > > Can we get TIKA-1404 in 1.6? Simple, but significant, fix. > > Tyler > On Aug 31, 2014 3:54 PM, "Mattmann, Chris A (3980)" < > chris.a.mattm...@jpl.nasa.gov> wrote: > >> Ugh, sorry. Maven release plugin issues, going to have to clean some >> stuff up here. Don't mind me folks. >> >> ++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++ >> >> >> >> >> >> >> -Original Message- >> From: , Chris Mattmann >> Reply-To: "dev@tika.apache.org" >> Date: Sunday, August 31, 2014 12:37 PM >> To: "dev@tika.apache.org" >> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >> >>> OK RC #2 coming up shortly, just brought the branch up to date in >>> r1621623. Also cleaned up JIRA. >>> >>> Here goes.. >>> >>> ++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: chris.a.mattm...@nasa.gov >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++ >>> >>> >>> >>> >>> >>> >>> -Original Message- >>> From: , Chris Mattmann >>> Date: Thursday, July 31, 2014 11:29 AM >>> To: "dev@tika.apache.org" >>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >>> >>>> Guys, based on all the comments here, I am going to roll another >>>> RC #2 to address: >>>> >>>> - Tyler's comment about getting the MicrosoftTranslator fix incorporated. >>>> - Dave's Lingo24 API plugin for translate >>>> - Nick's POI updates >>>> >>>> I'll roll another RC #2 probably on Monday. >>>> >>>> Thanks! >>>> >>>> Cheers, >>>> Chris >>>> >>>> P.S. When I do, I'll diff trunk against the branch and then roll any >>>> trunk updates post branch to 1.6 into the new 1.6 RC #2. >>>> >>>> ++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: chris.a.mattm...@nasa.gov >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++ >>>> Adjunct Associate Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++ >>>> >>>> >>>> >>>> >>>> >>>> >>>> -Original Message- >>>> From: , Chris Mattmann >>>> Reply-To: "dev@tika.apache.org" >>>> Date: Monday, July 28, 2014 11:45 AM >>>> To: "dev@tika.apache.org" >>>> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >>>> >>>>> Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS >>>>> thread for a few weeks about getting 1.6 out. Do you have a patch right >>>>> now for TIKA-1367? If so I'm happy to incorporate it a
Re: [VOTE] Apache Tika 1.6 release candidate #1
Can we get TIKA-1404 in 1.6? Simple, but significant, fix. Tyler On Aug 31, 2014 3:54 PM, "Mattmann, Chris A (3980)" < chris.a.mattm...@jpl.nasa.gov> wrote: > Ugh, sorry. Maven release plugin issues, going to have to clean some > stuff up here. Don't mind me folks. > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++ > > > > > > > -Original Message- > From: , Chris Mattmann > Reply-To: "dev@tika.apache.org" > Date: Sunday, August 31, 2014 12:37 PM > To: "dev@tika.apache.org" > Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 > > >OK RC #2 coming up shortly, just brought the branch up to date in > >r1621623. Also cleaned up JIRA. > > > >Here goes.. > > > >++ > >Chris Mattmann, Ph.D. > >Chief Architect > >Instrument Software and Science Data Systems Section (398) > >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >Office: 168-519, Mailstop: 168-527 > >Email: chris.a.mattm...@nasa.gov > >WWW: http://sunset.usc.edu/~mattmann/ > >++ > >Adjunct Associate Professor, Computer Science Department > >University of Southern California, Los Angeles, CA 90089 USA > >++++++++++++++++++++++++++ > > > > > > > > > > > > > >-Original Message- > >From: , Chris Mattmann > >Date: Thursday, July 31, 2014 11:29 AM > >To: "dev@tika.apache.org" > >Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 > > > >>Guys, based on all the comments here, I am going to roll another > >>RC #2 to address: > >> > >>- Tyler's comment about getting the MicrosoftTranslator fix incorporated. > >>- Dave's Lingo24 API plugin for translate > >>- Nick's POI updates > >> > >>I'll roll another RC #2 probably on Monday. > >> > >>Thanks! > >> > >>Cheers, > >>Chris > >> > >>P.S. When I do, I'll diff trunk against the branch and then roll any > >>trunk updates post branch to 1.6 into the new 1.6 RC #2. > >> > >>++ > >>Chris Mattmann, Ph.D. > >>Chief Architect > >>Instrument Software and Science Data Systems Section (398) > >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>Office: 168-519, Mailstop: 168-527 > >>Email: chris.a.mattm...@nasa.gov > >>WWW: http://sunset.usc.edu/~mattmann/ > >>++ > >>Adjunct Associate Professor, Computer Science Department > >>University of Southern California, Los Angeles, CA 90089 USA > >>++ > >> > >> > >> > >> > >> > >> > >>-Original Message- > >>From: , Chris Mattmann > >>Reply-To: "dev@tika.apache.org" > >>Date: Monday, July 28, 2014 11:45 AM > >>To: "dev@tika.apache.org" > >>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 > >> > >>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS > >>>thread for a few weeks about getting 1.6 out. Do you have a patch right > >>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2 > >>>to get it in. If you don't have a patch yet, would you mind terribly if > >>>we pushed out 1.6, which already today has a ton of great updates, then > >>>shortly thereafter rolled a 1.7 (or did so when you finished with > >>>TIKA-1367)? > >>> > >>>Cheers, > >>>Chris > >>> > >>> > >>>++++++ > >>>Chris Mattmann, Ph.D. > >>>Chief Architect > >&
Re: [VOTE] Apache Tika 1.6 release candidate #1
Ugh, sorry. Maven release plugin issues, going to have to clean some stuff up here. Don't mind me folks. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: , Chris Mattmann Reply-To: "dev@tika.apache.org" Date: Sunday, August 31, 2014 12:37 PM To: "dev@tika.apache.org" Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >OK RC #2 coming up shortly, just brought the branch up to date in >r1621623. Also cleaned up JIRA. > >Here goes.. > >++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: chris.a.mattm...@nasa.gov >WWW: http://sunset.usc.edu/~mattmann/ >++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++ > > > > > > >-Original Message- >From: , Chris Mattmann >Date: Thursday, July 31, 2014 11:29 AM >To: "dev@tika.apache.org" >Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 > >>Guys, based on all the comments here, I am going to roll another >>RC #2 to address: >> >>- Tyler's comment about getting the MicrosoftTranslator fix incorporated. >>- Dave's Lingo24 API plugin for translate >>- Nick's POI updates >> >>I'll roll another RC #2 probably on Monday. >> >>Thanks! >> >>Cheers, >>Chris >> >>P.S. When I do, I'll diff trunk against the branch and then roll any >>trunk updates post branch to 1.6 into the new 1.6 RC #2. >> >>++ >>Chris Mattmann, Ph.D. >>Chief Architect >>Instrument Software and Science Data Systems Section (398) >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>Office: 168-519, Mailstop: 168-527 >>Email: chris.a.mattm...@nasa.gov >>WWW: http://sunset.usc.edu/~mattmann/ >>++ >>Adjunct Associate Professor, Computer Science Department >>University of Southern California, Los Angeles, CA 90089 USA >>++ >> >> >> >> >> >> >>-Original Message- >>From: , Chris Mattmann >>Reply-To: "dev@tika.apache.org" >>Date: Monday, July 28, 2014 11:45 AM >>To: "dev@tika.apache.org" >>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >> >>>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS >>>thread for a few weeks about getting 1.6 out. Do you have a patch right >>>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2 >>>to get it in. If you don't have a patch yet, would you mind terribly if >>>we pushed out 1.6, which already today has a ton of great updates, then >>>shortly thereafter rolled a 1.7 (or did so when you finished with >>>TIKA-1367)? >>> >>>Cheers, >>>Chris >>> >>> >>>++ >>>Chris Mattmann, Ph.D. >>>Chief Architect >>>Instrument Software and Science Data Systems Section (398) >>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>Office: 168-519, Mailstop: 168-527 >>>Email: chris.a.mattm...@nasa.gov >>>WWW: http://sunset.usc.edu/~mattmann/ >>>++ >>>Adjunct Associate Professor, Computer Science Department >>>University of Southern California, Los Angeles, CA 90089 USA >>>++ >>> >>> >>> >>> >>> >>> >>>-Original Message- >>>From: Sergey Beryozkin >>>Re
Re: [VOTE] Apache Tika 1.6 release candidate #1
OK RC #2 coming up shortly, just brought the branch up to date in r1621623. Also cleaned up JIRA. Here goes.. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: , Chris Mattmann Date: Thursday, July 31, 2014 11:29 AM To: "dev@tika.apache.org" Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >Guys, based on all the comments here, I am going to roll another >RC #2 to address: > >- Tyler's comment about getting the MicrosoftTranslator fix incorporated. >- Dave's Lingo24 API plugin for translate >- Nick's POI updates > >I'll roll another RC #2 probably on Monday. > >Thanks! > >Cheers, >Chris > >P.S. When I do, I'll diff trunk against the branch and then roll any >trunk updates post branch to 1.6 into the new 1.6 RC #2. > >++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: chris.a.mattm...@nasa.gov >WWW: http://sunset.usc.edu/~mattmann/ >++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++ > > > > > > >-Original Message----- >From: , Chris Mattmann >Reply-To: "dev@tika.apache.org" >Date: Monday, July 28, 2014 11:45 AM >To: "dev@tika.apache.org" >Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 > >>Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS >>thread for a few weeks about getting 1.6 out. Do you have a patch right >>now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2 >>to get it in. If you don't have a patch yet, would you mind terribly if >>we pushed out 1.6, which already today has a ton of great updates, then >>shortly thereafter rolled a 1.7 (or did so when you finished with >>TIKA-1367)? >> >>Cheers, >>Chris >> >> >>++ >>Chris Mattmann, Ph.D. >>Chief Architect >>Instrument Software and Science Data Systems Section (398) >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>Office: 168-519, Mailstop: 168-527 >>Email: chris.a.mattm...@nasa.gov >>WWW: http://sunset.usc.edu/~mattmann/ >>++ >>Adjunct Associate Professor, Computer Science Department >>University of Southern California, Los Angeles, CA 90089 USA >>++ >> >> >> >> >> >> >>-Original Message- >>From: Sergey Beryozkin >>Reply-To: "dev@tika.apache.org" >>Date: Monday, July 28, 2014 11:38 AM >>To: "dev@tika.apache.org" >>Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >> >>>+0 given that it appears that the tika-parsers dependencies >>>documentation issue has been pushed away. I'm getting confused why. >>> >>>Thanks. Sergey >>> >>>[1] https://issues.apache.org/jira/browse/TIKA-1367 >>> >>>On 28/07/14 17:16, Tyler Palsulich wrote: >>>> +1 >>>> >>>> OSX 10.9.3, Java 1.7 >>>> >>>> Tyler >>>> >>>> >>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. >>>> >>>> wrote: >>>> >>>>> +1 >>>>> >>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 >>>>> Windows 7, Java 1.7 >>>>> >>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 >>>>>docs >>>>> (all formats) plus all available msoffice-x files in govdocs1, >>>>>yielding >>>>> 10,413 docs. There were several improvements in text extraction
RE: [VOTE] Apache Tika 1.6 release candidate #1
Nick, Just to be clear -- that wasn't a veiled complaint that you hadn't cut the 3.11-beta! I really just have not had a chance to start the run with my local build of poi-trunk. Thank you, as always! Best, Tim -Original Message- From: Nick Burch [mailto:apa...@gagravarr.org] Sent: Thursday, July 31, 2014 3:06 PM To: dev@tika.apache.org Subject: RE: [VOTE] Apache Tika 1.6 release candidate #1 On Thu, 31 Jul 2014, Allison, Timothy B. wrote: > On a related note, I did some digging on the one regression I found in > the pptx, and that will be solved if we wait for POI 3.11 beta 1. I > haven't yet had a chance to rerun on the random sample with the updated > POI... I'm currently on a train to France, but fingers crossed I'll be able to upload the POI 3.11 beta 1 artifacts for you to test with before I run out of English mobile phone signal... Nick
RE: [VOTE] Apache Tika 1.6 release candidate #1
On Thu, 31 Jul 2014, Allison, Timothy B. wrote: On a related note, I did some digging on the one regression I found in the pptx, and that will be solved if we wait for POI 3.11 beta 1. I haven't yet had a chance to rerun on the random sample with the updated POI... I'm currently on a train to France, but fingers crossed I'll be able to upload the POI 3.11 beta 1 artifacts for you to test with before I run out of English mobile phone signal... Nick
RE: [VOTE] Apache Tika 1.6 release candidate #1
All, On a related note, I did some digging on the one regression I found in the pptx, and that will be solved if we wait for POI 3.11 beta 1. I haven't yet had a chance to rerun on the random sample with the updated POI... Best, Tim -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Thursday, July 31, 2014 2:30 PM To: dev@tika.apache.org Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 Guys, based on all the comments here, I am going to roll another RC #2 to address: - Tyler's comment about getting the MicrosoftTranslator fix incorporated. - Dave's Lingo24 API plugin for translate - Nick's POI updates I'll roll another RC #2 probably on Monday. Thanks! Cheers, Chris P.S. When I do, I'll diff trunk against the branch and then roll any trunk updates post branch to 1.6 into the new 1.6 RC #2. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: , Chris Mattmann Reply-To: "dev@tika.apache.org" Date: Monday, July 28, 2014 11:45 AM To: "dev@tika.apache.org" Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS >thread for a few weeks about getting 1.6 out. Do you have a patch right >now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2 >to get it in. If you don't have a patch yet, would you mind terribly if >we pushed out 1.6, which already today has a ton of great updates, then >shortly thereafter rolled a 1.7 (or did so when you finished with >TIKA-1367)? > >Cheers, >Chris > > >++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: chris.a.mattm...@nasa.gov >WWW: http://sunset.usc.edu/~mattmann/ >++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++ > > > > > > >-Original Message- >From: Sergey Beryozkin >Reply-To: "dev@tika.apache.org" >Date: Monday, July 28, 2014 11:38 AM >To: "dev@tika.apache.org" >Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 > >>+0 given that it appears that the tika-parsers dependencies >>documentation issue has been pushed away. I'm getting confused why. >> >>Thanks. Sergey >> >>[1] https://issues.apache.org/jira/browse/TIKA-1367 >> >>On 28/07/14 17:16, Tyler Palsulich wrote: >>> +1 >>> >>> OSX 10.9.3, Java 1.7 >>> >>> Tyler >>> >>> >>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. >>> >>> wrote: >>> >>>> +1 >>>> >>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 >>>> Windows 7, Java 1.7 >>>> >>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 >>>>docs >>>> (all formats) plus all available msoffice-x files in govdocs1, >>>>yielding >>>> 10,413 docs. There were several improvements in text extraction for >>>>PDFs >>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf). >>>> >>>> There was one regression: >>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx >>>> >>>> Stacktrace: >>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out >>>>of >>>> range: -369073454 >>>> at java.lang.String.checkBounds(String.java:371) >>>> at java.lang.String.(String.java:415) >>>> at >>>> >>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java >>>>: >>>>114) >>&g
Re: [VOTE] Apache Tika 1.6 release candidate #1
Another quick thought on the artifiacts in http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ - as well as needing to ditch original-tika-app.jar, shouldn't we have the Tika Server standalone jar in there too as another released + easily downloadable jar? Thanks Nick On 28/07/14 05:22, Mattmann, Chris A (3980) wrote: Hi Folks, A candidate for the Tika 1.6 release is available at: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.6/ The SHA1 checksum of the archive is 076ad343be56a540a4c8e395746fa4fda5b5b6d3. A Maven staging repository is available at: https://repository.apache.org/content/repositories/orgapachetika-1003/ Please vote on releasing this package as Apache Tika 1.6. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.6 [ ] -1 Do not release this package becauseŠ Thank you! Cheers, Chris P.S. Here is my +1!
Re: [VOTE] Apache Tika 1.6 release candidate #1
Guys, based on all the comments here, I am going to roll another RC #2 to address: - Tyler's comment about getting the MicrosoftTranslator fix incorporated. - Dave's Lingo24 API plugin for translate - Nick's POI updates I'll roll another RC #2 probably on Monday. Thanks! Cheers, Chris P.S. When I do, I'll diff trunk against the branch and then roll any trunk updates post branch to 1.6 into the new 1.6 RC #2. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: , Chris Mattmann Reply-To: "dev@tika.apache.org" Date: Monday, July 28, 2014 11:45 AM To: "dev@tika.apache.org" Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS >thread for a few weeks about getting 1.6 out. Do you have a patch right >now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2 >to get it in. If you don't have a patch yet, would you mind terribly if >we pushed out 1.6, which already today has a ton of great updates, then >shortly thereafter rolled a 1.7 (or did so when you finished with >TIKA-1367)? > >Cheers, >Chris > > >++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: chris.a.mattm...@nasa.gov >WWW: http://sunset.usc.edu/~mattmann/ >++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++ > > > > > > >-Original Message----- >From: Sergey Beryozkin >Reply-To: "dev@tika.apache.org" >Date: Monday, July 28, 2014 11:38 AM >To: "dev@tika.apache.org" >Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 > >>+0 given that it appears that the tika-parsers dependencies >>documentation issue has been pushed away. I'm getting confused why. >> >>Thanks. Sergey >> >>[1] https://issues.apache.org/jira/browse/TIKA-1367 >> >>On 28/07/14 17:16, Tyler Palsulich wrote: >>> +1 >>> >>> OSX 10.9.3, Java 1.7 >>> >>> Tyler >>> >>> >>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. >>> >>> wrote: >>> >>>> +1 >>>> >>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 >>>> Windows 7, Java 1.7 >>>> >>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 >>>>docs >>>> (all formats) plus all available msoffice-x files in govdocs1, >>>>yielding >>>> 10,413 docs. There were several improvements in text extraction for >>>>PDFs >>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf). >>>> >>>> There was one regression: >>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx >>>> >>>> Stacktrace: >>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out >>>>of >>>> range: -369073454 >>>> at java.lang.String.checkBounds(String.java:371) >>>> at java.lang.String.(String.java:415) >>>> at >>>> >>>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java >>>>: >>>>114) >>>> at >>>> >>>>org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163 >>>>) >>>> at >>>> >>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject >>>>( >>>>Ole10Native.java:91) >>>> at >>>> >>>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject >>>>( >>>>Ole10Native.java:63) >>>> at >
Re: [VOTE] Apache Tika 1.6 release candidate #1
Hi All, After the recent NPE that Chris found ( https://issues.apache.org/jira/browse/TIKA-1378), we should roll an RC#2. Tyler On Wed, Jul 30, 2014 at 10:55 AM, Nick Burch wrote: > On Mon, 28 Jul 2014, Mattmann, Chris A (3980) wrote: > >> A candidate for the Tika 1.6 release is available at: >> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ >> > > Should original-tika-app-1.6.jar be in there? IIRC we decided in the 1.5 > release that it shouldn't be > > > > Please vote on releasing this package as Apache Tika 1.6. >> The vote is open for the next 72 hours and passes if a majority of at >> least three +1 Tika PMC votes are cast. >> > > Otherwise I'm +1 > > Nick >
Re: [VOTE] Apache Tika 1.6 release candidate #1
On Mon, 28 Jul 2014, Mattmann, Chris A (3980) wrote: A candidate for the Tika 1.6 release is available at: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ Should original-tika-app-1.6.jar be in there? IIRC we decided in the 1.5 release that it shouldn't be Please vote on releasing this package as Apache Tika 1.6. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. Otherwise I'm +1 Nick
Re: [VOTE] Apache Tika 1.6 release candidate #1
Hi On 29/07/14 13:14, Nick Burch wrote: On Mon, 28 Jul 2014, Sergey Beryozkin wrote: This is not an issue that should block the release, I was careful not to vote with a minus one. I've become a bit impatient, but no one really blocks me from completing this pure documentation effort myself, I was hoping that someone would do it first :-). Given that this is a documentation / website enhancement, I don't see any reason why we couldn't post the details for 1.6 (and even perhaps 1.5!) to the site in a few weeks time, irrespective of when the 1.6 release goes out :) Yes, you are right, Cheers, Sergey Cheers Nick
RE: [VOTE] Apache Tika 1.6 release candidate #1
On Mon, 28 Jul 2014, Allison, Timothy B. wrote: There was one regression: http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx Stacktrace: Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454 at java.lang.String.checkBounds(String.java:371) at java.lang.String.(String.java:415) at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114) at org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163) Any chance you could raise a POI bug for this? We're probably going to do the next POI beta release within a week, so if you hurry it might even get fixed in that... :) Nick
Re: [VOTE] Apache Tika 1.6 release candidate #1
On Mon, 28 Jul 2014, Sergey Beryozkin wrote: This is not an issue that should block the release, I was careful not to vote with a minus one. I've become a bit impatient, but no one really blocks me from completing this pure documentation effort myself, I was hoping that someone would do it first :-). Given that this is a documentation / website enhancement, I don't see any reason why we couldn't post the details for 1.6 (and even perhaps 1.5!) to the site in a few weeks time, irrespective of when the 1.6 release goes out :) Cheers Nick
Re: [VOTE] Apache Tika 1.6 release candidate #1
Thank you Sergey! OK I will proceed. THanks for your contributions to Tika and yes we'll get there ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Sergey Beryozkin Reply-To: "dev@tika.apache.org" Date: Monday, July 28, 2014 3:16 PM To: "dev@tika.apache.org" Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >Hi Chris, > >This is not an issue that should block the release, I was careful not to >vote with a minus one. I've become a bit impatient, but no one really >blocks me from completing this pure documentation effort myself, I was >hoping that someone would do it first :-). > >Please go ahead with the release as planned, thanks for offering the >chance to delay the release, but I can not go for it, we'll get there as >far as the documentation is concerned :-) > >Thanks, Sergey > >On 28/07/14 21:45, Mattmann, Chris A (3980) wrote: >> Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS >> thread for a few weeks about getting 1.6 out. Do you have a patch right >> now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2 >> to get it in. If you don't have a patch yet, would you mind terribly if >> we pushed out 1.6, which already today has a ton of great updates, then >> shortly thereafter rolled a 1.7 (or did so when you finished with >> TIKA-1367)? >> >> Cheers, >> Chris >> >> >> ++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++ >> >> >> >> >> >> >> -Original Message- >> From: Sergey Beryozkin >> Reply-To: "dev@tika.apache.org" >> Date: Monday, July 28, 2014 11:38 AM >> To: "dev@tika.apache.org" >> Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >> >>> +0 given that it appears that the tika-parsers dependencies >>> documentation issue has been pushed away. I'm getting confused why. >>> >>> Thanks. Sergey >>> >>> [1] https://issues.apache.org/jira/browse/TIKA-1367 >>> >>> On 28/07/14 17:16, Tyler Palsulich wrote: >>>> +1 >>>> >>>> OSX 10.9.3, Java 1.7 >>>> >>>> Tyler >>>> >>>> >>>> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. >>>> >>>> wrote: >>>> >>>>> +1 >>>>> >>>>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 >>>>> Windows 7, Java 1.7 >>>>> >>>>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 >>>>> docs >>>>> (all formats) plus all available msoffice-x files in govdocs1, >>>>>yielding >>>>> 10,413 docs. There were several improvements in text extraction for >>>>> PDFs >>>>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf). >>>>> >>>>> There was one regression: >>>>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx >>>>> >>>>> Stacktrace: >>>>> Caused by: java.lang.StringIndexOutOfBoundsException: String index >>>>>out >>>>> of >>>>> range: -369073454 >>>>> at java.lang.String.checkBounds(String.java:371) >>>>> at java.lang.String.(String.java:415) >>>>> at >>>>> >>>>> >>>>>org.apache.poi.util.Str
Re: [VOTE] Apache Tika 1.6 release candidate #1
Hi Chris, This is not an issue that should block the release, I was careful not to vote with a minus one. I've become a bit impatient, but no one really blocks me from completing this pure documentation effort myself, I was hoping that someone would do it first :-). Please go ahead with the release as planned, thanks for offering the chance to delay the release, but I can not go for it, we'll get there as far as the documentation is concerned :-) Thanks, Sergey On 28/07/14 21:45, Mattmann, Chris A (3980) wrote: Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS thread for a few weeks about getting 1.6 out. Do you have a patch right now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2 to get it in. If you don't have a patch yet, would you mind terribly if we pushed out 1.6, which already today has a ton of great updates, then shortly thereafter rolled a 1.7 (or did so when you finished with TIKA-1367)? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Sergey Beryozkin Reply-To: "dev@tika.apache.org" Date: Monday, July 28, 2014 11:38 AM To: "dev@tika.apache.org" Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 +0 given that it appears that the tika-parsers dependencies documentation issue has been pushed away. I'm getting confused why. Thanks. Sergey [1] https://issues.apache.org/jira/browse/TIKA-1367 On 28/07/14 17:16, Tyler Palsulich wrote: +1 OSX 10.9.3, Java 1.7 Tyler On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. wrote: +1 Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 Windows 7, Java 1.7 I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs. There were several improvements in text extraction for PDFs (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf). There was one regression: http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx Stacktrace: Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454 at java.lang.String.checkBounds(String.java:371) at java.lang.String.(String.java:415) at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java: 114) at org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject( Ole10Native.java:91) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject( Ole10Native.java:63) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe ddedOLE(AbstractOOXMLExtractor.java:250) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe ddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(A bstractOOXMLExtractor.java:115) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXML ExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.jav a:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243) -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Monday, July 28, 2014 12:22 AM To: dev@tika.apache.org Cc: u...@tika.apache.org Subject: [VOTE] Apache Tika 1.6 release candidate #1 Hi Folks, A candidate for the Tika 1.6 release is available at: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.6/ The SHA1 checksum of the archive is 076ad343be56a540a4c8e395746fa4fda5b5b6d3. A Maven staging repository is available at: https://repository.apache.org/content/repositories/orgapachetika-1003/ Please vote on releasing this package as Apache Tika 1.6. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.6 [ ] -1 Do not release this package becauseŠ Thank you! Cheers, Chris P.S. Here is my +1!
Re: [VOTE] Apache Tika 1.6 release candidate #1
Thanks Sergey - I pushed to 1.7 since we have been having a DISCUSS thread for a few weeks about getting 1.6 out. Do you have a patch right now for TIKA-1367? If so I'm happy to incorporate it and roll an RC #2 to get it in. If you don't have a patch yet, would you mind terribly if we pushed out 1.6, which already today has a ton of great updates, then shortly thereafter rolled a 1.7 (or did so when you finished with TIKA-1367)? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Sergey Beryozkin Reply-To: "dev@tika.apache.org" Date: Monday, July 28, 2014 11:38 AM To: "dev@tika.apache.org" Subject: Re: [VOTE] Apache Tika 1.6 release candidate #1 >+0 given that it appears that the tika-parsers dependencies >documentation issue has been pushed away. I'm getting confused why. > >Thanks. Sergey > >[1] https://issues.apache.org/jira/browse/TIKA-1367 > >On 28/07/14 17:16, Tyler Palsulich wrote: >> +1 >> >> OSX 10.9.3, Java 1.7 >> >> Tyler >> >> >> On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. >> >> wrote: >> >>> +1 >>> >>> Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 >>> Windows 7, Java 1.7 >>> >>> I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 >>>docs >>> (all formats) plus all available msoffice-x files in govdocs1, yielding >>> 10,413 docs. There were several improvements in text extraction for >>>PDFs >>> (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf). >>> >>> There was one regression: >>> http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx >>> >>> Stacktrace: >>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out >>>of >>> range: -369073454 >>> at java.lang.String.checkBounds(String.java:371) >>> at java.lang.String.(String.java:415) >>> at >>> >>>org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java: >>>114) >>> at >>> >>>org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163) >>> at >>> >>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject( >>>Ole10Native.java:91) >>> at >>> >>>org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject( >>>Ole10Native.java:63) >>> at >>> >>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe >>>ddedOLE(AbstractOOXMLExtractor.java:250) >>> at >>> >>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbe >>>ddedParts(AbstractOOXMLExtractor.java:199) >>> at >>> >>>org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(A >>>bstractOOXMLExtractor.java:115) >>> at >>> >>>org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXML >>>ExtractorFactory.java:112) >>> at >>> >>>org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.jav >>>a:82) >>> at >>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243) >>> >>> >>> -Original Message- >>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] >>> Sent: Monday, July 28, 2014 12:22 AM >>> To: dev@tika.apache.org >>> Cc: u...@tika.apache.org >>> Subject: [VOTE] Apache Tika 1.6 release candidate #1 >>> >>> Hi Folks, >>> >>> A candidate for the Tika 1.6 release is available at: >>> >>> http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ >>> >>> >>> The release candidate is a zip archive of the sources in: >>> >>> http://svn.apache.org/repos/asf/tika/tags/1.6/ >>> >>> The SHA1 checksum of the archive is >>> 076ad343be56a540a4c8e395746fa4fda5b5b6d3. >>> >>> A Maven staging repository is available at: >>> >>> https://repository.apache.org/content/repositories/orgapachetika-1003/ >>> >>> >>> Please vote on releasing this package as Apache Tika 1.6. >>> The vote is open for the next 72 hours and passes if a majority of at >>> least three +1 Tika PMC votes are cast. >>> >>> [ ] +1 Release this package as Apache Tika 1.6 >>> [ ] -1 Do not release this package becauseŠ >>> >>> Thank you! >>> >>> Cheers, >>> Chris >>> >>> P.S. Here is my +1! >>> >>> >>> >>> >>> >>> >>
Re: [VOTE] Apache Tika 1.6 release candidate #1
+0 given that it appears that the tika-parsers dependencies documentation issue has been pushed away. I'm getting confused why. Thanks. Sergey [1] https://issues.apache.org/jira/browse/TIKA-1367 On 28/07/14 17:16, Tyler Palsulich wrote: +1 OSX 10.9.3, Java 1.7 Tyler On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. wrote: +1 Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 Windows 7, Java 1.7 I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs. There were several improvements in text extraction for PDFs (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf). There was one regression: http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx Stacktrace: Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454 at java.lang.String.checkBounds(String.java:371) at java.lang.String.(String.java:415) at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114) at org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243) -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Monday, July 28, 2014 12:22 AM To: dev@tika.apache.org Cc: u...@tika.apache.org Subject: [VOTE] Apache Tika 1.6 release candidate #1 Hi Folks, A candidate for the Tika 1.6 release is available at: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.6/ The SHA1 checksum of the archive is 076ad343be56a540a4c8e395746fa4fda5b5b6d3. A Maven staging repository is available at: https://repository.apache.org/content/repositories/orgapachetika-1003/ Please vote on releasing this package as Apache Tika 1.6. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.6 [ ] -1 Do not release this package becauseŠ Thank you! Cheers, Chris P.S. Here is my +1!
Re: [VOTE] Apache Tika 1.6 release candidate #1
+1 OSX 10.9.3, Java 1.7 Tyler On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B. wrote: > +1 > > Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 > Windows 7, Java 1.7 > > I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs > (all formats) plus all available msoffice-x files in govdocs1, yielding > 10,413 docs. There were several improvements in text extraction for PDFs > (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf). > > There was one regression: > http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx > > Stacktrace: > Caused by: java.lang.StringIndexOutOfBoundsException: String index out of > range: -369073454 > at java.lang.String.checkBounds(String.java:371) > at java.lang.String.(String.java:415) > at > org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114) > at > org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163) > at > org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91) > at > org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243) > > > -Original Message- > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] > Sent: Monday, July 28, 2014 12:22 AM > To: dev@tika.apache.org > Cc: u...@tika.apache.org > Subject: [VOTE] Apache Tika 1.6 release candidate #1 > > Hi Folks, > > A candidate for the Tika 1.6 release is available at: > > http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ > > > The release candidate is a zip archive of the sources in: > > http://svn.apache.org/repos/asf/tika/tags/1.6/ > > The SHA1 checksum of the archive is > 076ad343be56a540a4c8e395746fa4fda5b5b6d3. > > A Maven staging repository is available at: > > https://repository.apache.org/content/repositories/orgapachetika-1003/ > > > Please vote on releasing this package as Apache Tika 1.6. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.6 > [ ] -1 Do not release this package becauseŠ > > Thank you! > > Cheers, > Chris > > P.S. Here is my +1! > > > > > >
RE: [VOTE] Apache Tika 1.6 release candidate #1
+1 Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7 Windows 7, Java 1.7 I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs. There were several improvements in text extraction for PDFs (mostly spacing) and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf). There was one regression: http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx Stacktrace: Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454 at java.lang.String.checkBounds(String.java:371) at java.lang.String.(String.java:415) at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114) at org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243) -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Monday, July 28, 2014 12:22 AM To: dev@tika.apache.org Cc: u...@tika.apache.org Subject: [VOTE] Apache Tika 1.6 release candidate #1 Hi Folks, A candidate for the Tika 1.6 release is available at: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.6/ The SHA1 checksum of the archive is 076ad343be56a540a4c8e395746fa4fda5b5b6d3. A Maven staging repository is available at: https://repository.apache.org/content/repositories/orgapachetika-1003/ Please vote on releasing this package as Apache Tika 1.6. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.6 [ ] -1 Do not release this package becauseŠ Thank you! Cheers, Chris P.S. Here is my +1!
Re: [VOTE] Apache Tika 1.6 release candidate #1
[x] +1 Release this package as Apache Tika 1.6. Tested on the following systems: 1. Microsoft Windows 7 Enterprise, SP 1, x64-based PC 2. Linux ubuntu 3.11.0-24-generic #42-Ubuntu SMP x86_64 GNU/Linux Thanks, Oleg On Mon, Jul 28, 2014 at 7:22 AM, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hi Folks, > > A candidate for the Tika 1.6 release is available at: > > http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ > > > The release candidate is a zip archive of the sources in: > > http://svn.apache.org/repos/asf/tika/tags/1.6/ > > The SHA1 checksum of the archive is > 076ad343be56a540a4c8e395746fa4fda5b5b6d3. > > A Maven staging repository is available at: > > https://repository.apache.org/content/repositories/orgapachetika-1003/ > > > Please vote on releasing this package as Apache Tika 1.6. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.6 > [ ] -1 Do not release this package becauseŠ > > Thank you! > > Cheers, > Chris > > P.S. Here is my +1! > > > > > >