Oh sorry, I thought I have sent to dev list, forwarding... Luis
---------- Forwarded message ---------- From: Allison, Timothy B. <talli...@mitre.org> Date: 2017-12-07 14:10 GMT-02:00 Subject: RE: Tika 1.17? To: "lfcnas...@gmail.com" <lfcnas...@gmail.com> Agreed. Thank you! Do you mind sharing this with the list? *From:* Luís Filipe Nassif [mailto:lfcnas...@gmail.com] *Sent:* Thursday, December 7, 2017 10:26 AM *To:* Allison, Timothy B. <talli...@mitre.org> *Subject:* RE: Tika 1.17? Hi Tim, I don't think it is a blocker, maybe a minor regression, given we are much better with 20x more fixed exceptions. I sent it just to let us be aware. There are some few ~40 new exceptions with pdf, and 20x more fixed ones, so my vote is to go for 1.17! Luis Em 7 de dez de 2017 11:47 AM, "Allison, Timothy B." <talli...@mitre.org> escreveu: Thank you, Luís! Given where POI is in its dev cycle, should we go for a release of 1.17 now and then push for a 1.17.1 as soon as POI fixes this? Should we revert to 3.17-beta1? (wait, we can't do this because of a bug that prevents parsing of pptx in Solr) Or is this grave enough to wait a few months before we release 1.17? I found a zip/mime detection issue that we need to fix at the Tika level, but that fix is trivial. -----Original Message----- From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] Sent: Wednesday, December 6, 2017 9:30 AM To: dev@tika.apache.org Subject: Re: Tika 1.17? Hi Tim, I've had a briefly look at exceptions folder, seems we are much better with ppt (4677 fixed exceptions) and pdf (7798), but there are 208 new exceptions with ppt. I did not check the files to see if they are corrupted, but some common tokens were lost. Below the most common new stacktrace: org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 1010 on class class org.apache.poi.hslf.record.Environment : java.lang.reflect.InvocationTargetException Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 2005 on class class org.apache.poi.hslf.record.FontCollection : java.lang.reflect.InvocationTargetException Cause was : java.lang.IllegalArgumentException: typeface can't be null nor empty at org.apache.poi.hslf.record.Record.createRecordForType( Record.java:186) at org.apache.poi.hslf.record.Record.buildRecordAtOffset(Record.java:104) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read( HSLFSlideShowImpl.java:279) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords( HSLFSlideShowImpl.java:260) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.<init>( HSLFSlideShowImpl.java:166) at org.apache.poi.hslf.usermodel.HSLFSlideShow.<init>(HSLFSlideShow.java:181) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:78) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:179) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at org.apache.tika.parser.RecursiveParserWrapper.parse( RecursiveParserWrapper.java:158) at org.apache.tika.batch.FileResourceConsumer.parse( FileResourceConsumer.java:406) at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsum er.processFileResource(RecursiveParserWrapperFSConsumer.java:104) at org.apache.tika.batch.FileResourceConsumer._processFileResource( FileResourceConsumer.java:181) at org.apache.tika.batch.FileResourceConsumer.call( FileResourceConsumer.java:115) at org.apache.tika.batch.FileResourceConsumer.call( FileResourceConsumer.java:50) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor283.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:182) ... 25 more Caused by: org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 1010 on class class org.apache.poi.hslf.record.Environment : java.lang.reflect.InvocationTargetException Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 2005 on class class org.apache.poi.hslf.record.FontCollection : java.lang.reflect.InvocationTargetException Cause was : java.lang.IllegalArgumentException: typeface can't be null nor empty at org.apache.poi.hslf.record.Record.createRecordForType( Record.java:186) at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:129) at org.apache.poi.hslf.record.Document.<init>(Document.java:133) ... 29 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor285.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:182) ... 31 more Caused by: org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the class for type with id 2005 on class class org.apache.poi.hslf.record.FontCollection : java.lang.reflect.InvocationTargetException Cause was : java.lang.IllegalArgumentException: typeface can't be null nor empty at org.apache.poi.hslf.record.Record.createRecordForType( Record.java:186) at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:129) at org.apache.poi.hslf.record.Environment.<init>(Environment.java:54) ... 35 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor286.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:182) ... 37 more Caused by: java.lang.IllegalArgumentException: typeface can't be null nor empty at org.apache.poi.hslf.usermodel.HSLFFontInfo.setTypeface( HSLFFontInfo.java:129) at org.apache.poi.hslf.usermodel.HSLFFontInfo.<init>(HSLFFontInfo.java:74) at org.apache.poi.hslf.record.FontCollection.<init>(FontCollection.java:47) ... 41 more 2017-12-05 21:44 GMT-02:00 Allison, Timothy B. <talli...@mitre.org>: > Reports are here: > > http://162.242.228.174/reports/reports_Tika1_16V1_17.zip > > I haven't had a chance to look. Tomorrow... > > Let me know what you find. > > -----Original Message----- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Wednesday, November 29, 2017 1:08 PM > To: dev@tika.apache.org > Subject: RE: Tika 1.17? > > +1 > > -----Original Message----- > From: Chris Mattmann [mailto:mattm...@apache.org] > Sent: Wednesday, November 29, 2017 12:57 PM > To: dev@tika.apache.org > Subject: Re: Tika 1.17? > > Thanks so much for fixing this. It worked during MEMEX and then I > think has since fallen out of date and perhaps I committed Zarana’s > code wrong or something. Will be great to get this working! > > > > On 11/29/17, 9:54 AM, "David Meikle" <loo...@gmail.com> wrote: > > I am thinking TIKA-2385. I've got a resized image that I can > commit tonight > that should close this one off. > > Cheers, > Dave > > > On 29 Nov 2017 14:42, "Allison, Timothy B." <talli...@mitre.org> > wrote: > > Many thanks to Bob for help on TIKA-2502! > > Anything else we want to put into 1.17 before I run the regression > tests? > > -----Original Message----- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, November 13, 2017 1:42 PM > To: dev@tika.apache.org > Subject: RE: Tika 1.17? > > Y. You're right. Thank you! > > I think I've been avoiding that because there were some regressions in > metadata-extractor last I looked at this. Let's hope those are gone in > 2.10.1. > > -----Original Message----- > From: Tyler Bui-Palsulich [mailto:tpalsul...@apache.org] > Sent: Sunday, November 12, 2017 2:54 PM > To: dev@tika.apache.org > Subject: RE: Tika 1.17? > > TIKA-2486 might be worth blocking on since there is a CVE. > > Tyler > > On Nov 6, 2017 5:26 AM, "Allison, Timothy B." <talli...@mitre.org> > wrote: > > > Y. I'm happy enough to wait a few more days. I wasn't able to kick > > off the regression tests last week. Should I wait for the new > parsers > > to run the regression tests? > > > > -----Original Message----- > > From: David Meikle [mailto:loo...@gmail.com] > > Sent: Friday, November 3, 2017 7:42 PM > > To: dev@tika.apache.org > > Subject: Re: Tika 1.17? > > > > Sounds good. I have a couple of new parsers I would like to slot in > > but not had a chance the last few months. Will go for it over the > > weekend, if that works for you Tim. > > > > Cheers, > > Dave > > > > > > > > On 3 November 2017 at 15:19, Mattmann, Chris A (3010) < > > chris.a.mattm...@jpl.nasa.gov> wrote: > > > > > Let’s make it so ( > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++++++++ > > > Chris Mattmann, Ph.D. > > > Principal Data Scientist, Engineering Administrative Office (3010) > > > Manager, NSF & Open Source Projects Formulation and Development > > > Offices > > > (8212) > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > > Office: 180-503E, Mailstop: 180-503 > > > Email: chris.a.mattm...@nasa.gov > > > WWW: http://sunset.usc.edu/~mattmann/ > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++++++++ > > > Director, Information Retrieval and Data Science Group (IRDS) > > > Adjunct Associate Professor, Computer Science Department University > > > of Southern California, Los Angeles, CA 90089 USA > > > WWW: http://irds.usc.edu/ > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++++++++ > > > > > > > > > > > > On 11/3/17, 7:35 AM, "Allison, Timothy B." > <talli...@mitre.org> > wrote: > > > > > > All, > > > > > > PDFBox 2.0.8 is now integrated. I want to fix TIKA-2490 before > > > we release 1.17. Are there other issues that are blockers or you'd > > > like to fix before 1.17 (TIKA-2471, maybe?)? > > > > > > I plan to run initial large scale regression tests shortly for > > > rfc822 and mbox because of TIKA-2478. I'll run the full regression > > > tests before cutting the RC, but I want to focus on those for now. > Other requests? > > > > > > Cheers, > > > > > > Tim > > > > > > > > > > > > > > >