Re: checkstyle failures

2023-08-13 Thread Luís Filipe Nassif
Not sure, but maybe we could relax the mandatory rules a bit? This would
make contributions from external collaborators easier... Also for commiters
not contributing too often, at least this causes some difficulties for me
too...



Em dom, 13 de ago de 2023 01:03, Tilman Hausherr 
escreveu:

> The checkstyle failures are weird. Here's a fix of a failure:
>
>
> https://github.com/apache/tika/commit/50a03d85835405338d47e29fb453c8bc274eac79
>
>
> It makes the code worse IMHO. What's even more difficult is that
> checkstyle doesn't report where the error is. There is a report in the
> "site" directory but that one has hundreds of failures.
>
> I tried to make changes in the config in the pom.xml but failed. I
> didn't want to spend more time on this so I committed by changes slowly
> (which is a habit of mine anyway) to see which file would fail.
>
> Tilman
>
>


Re: next releases -- 2.4.1 and 1.28.4

2022-06-12 Thread Luís Filipe Nassif
+1 from me

Em ter, 7 de jun de 2022 12:09, Tim Allison  escreveu:

> All,
>
> Any objections to starting the release processes for 1.x and 2.x in the
> next few days?  Any blockers?  Anything we should wait for?
>
> Thank you, all.
>
> Cheers,
>
> Tim
>


Re: Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2

2022-05-02 Thread Luís Filipe Nassif
Just got a successful build on Windows 10 JDK 11 after fixing line endings
of testVCalendar.vcs

+1

Em seg., 2 de mai. de 2022 às 11:33, Tim Allison 
escreveu:

> > Maybe we should configure that explicitly in .gitattributes in the repo.
>
> +1 go for it!
>
> On Mon, May 2, 2022 at 10:11 AM Luís Filipe Nassif 
> wrote:
> >
> > You're right Tim! Those test files are using \r\n. I'll set EOL to \n,
> > clone from scratch and try again. Maybe we should configure that
> explicitly
> > in .gitattributes in the repo.
> >
> > Em seg., 2 de mai. de 2022 às 10:44, Tim Allison 
> > escreveu:
> >
> > > Hahahaha...doh.  I've gotten those kinds of errors before on Windows
> > > when I hadn't set git global EOL to "\n".  There's a chance that git
> > > has converted the line endings in those test files to \r\n.  Can you
> > > open the test files in a hex editor and see what the line endings look
> > > like?
> > >
> > > We need to improve documentation on line endings on Windows.
> > >
> > > On Mon, May 2, 2022 at 9:40 AM Luís Filipe Nassif  >
> > > wrote:
> > > >
> > > > Just got these build failures on Windows 10 JDK 11:
> > > >
> > > > [ERROR] Failures:
> > > > [ERROR]   TextAndCSVParserTest.testSubclassingMimeTypesRemain:217
> > > > expected:<...-vcalendar; charset=[ISO-8859-1]> but
> was:<...-vcalendar;
> > > > charset=[windows-1252]>
> > > > [ERROR]   TXTParserTest.testSubclassingMimeTypesRemain:299
> > > > expected:<...-vcalendar; charset=[ISO-8859-1]> but
> was:<...-vcalendar;
> > > > charset=[windows-1252]>
> > > >
> > > >
> > > > Em seg., 2 de mai. de 2022 às 08:15, Tim Allison <
> talli...@apache.org>
> > > > escreveu:
> > > >
> > > > > Thank you, Tilman!
> > > > >
> > > > > I'll give it a few more hours in case anyone wants to -1 it.
> > > > >
> > > > > On Mon, May 2, 2022 at 6:20 AM Tilman Hausherr <
> thaush...@t-online.de>
> > > > > wrote:
> > > > > >
> > > > > >  Go with what we have
> > > > > > Tilman
> > > > > >
> > > > > >
> > > > > >
> > > > > > --- Original-Nachricht ---
> > > > > > Von: Tim Allison
> > > > > > Betreff: Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2
> > > > > > Datum: 02. Mai 2022, 12:15
> > > > > > An: 
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > I confirmed my memory of TikaServerIntegrationTest. The
> difference is
> > > > > > that in 1.x, we were still using threads to start the server and
> then
> > > > > > hoping that<http://thread.interrupt> () actually shut the
> process
> > > down
> > > > > in a
> > > > > > reasonable amount of time. In 2.x, we're starting processes and
> then
> > > > > > force quitting those after each test. In my experience, this got
> rid
> > > > > > of the flaky tests in 2.x
> > > > > >
> > > > > > We didn't change any of the underlying server logic in 1.x. This
> is a
> > > > > > problem of flaky tests.
> > > > > >
> > > > > > Given that we have the 1.x branch through September, I'll
> convert the
> > > > > > thread based tests to process based tests shortly. The question
> is:
> > > > > > should we roll an rc3 or go with what we have.
> > > > > >
> > > > > > Thank you, all.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Tim
> > > > > >
> > > > > > On Sat, Apr 30, 2022 at 9:10 AM Tim Allison  > > > > > <mailto:talli...@apache.org> > wrote:
> > > > > > >
> > > > > > > That's exactly my understanding as well. That test is not
> flaky in
> > > 2x
> > > > > > > because of modifications I made to the integration tests in 2x.
> > > > > > >
> > > > > > > In 2x, the Solr tests can be flaky, and there's an open issue
> for
> > > > > > > that. I don't like flaky tests. Sorry.
> > 

Re: Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2

2022-05-02 Thread Luís Filipe Nassif
You're right Tim! Those test files are using \r\n. I'll set EOL to \n,
clone from scratch and try again. Maybe we should configure that explicitly
in .gitattributes in the repo.

Em seg., 2 de mai. de 2022 às 10:44, Tim Allison 
escreveu:

> Hahahaha...doh.  I've gotten those kinds of errors before on Windows
> when I hadn't set git global EOL to "\n".  There's a chance that git
> has converted the line endings in those test files to \r\n.  Can you
> open the test files in a hex editor and see what the line endings look
> like?
>
> We need to improve documentation on line endings on Windows.
>
> On Mon, May 2, 2022 at 9:40 AM Luís Filipe Nassif 
> wrote:
> >
> > Just got these build failures on Windows 10 JDK 11:
> >
> > [ERROR] Failures:
> > [ERROR]   TextAndCSVParserTest.testSubclassingMimeTypesRemain:217
> > expected:<...-vcalendar; charset=[ISO-8859-1]> but was:<...-vcalendar;
> > charset=[windows-1252]>
> > [ERROR]   TXTParserTest.testSubclassingMimeTypesRemain:299
> > expected:<...-vcalendar; charset=[ISO-8859-1]> but was:<...-vcalendar;
> > charset=[windows-1252]>
> >
> >
> > Em seg., 2 de mai. de 2022 às 08:15, Tim Allison 
> > escreveu:
> >
> > > Thank you, Tilman!
> > >
> > > I'll give it a few more hours in case anyone wants to -1 it.
> > >
> > > On Mon, May 2, 2022 at 6:20 AM Tilman Hausherr 
> > > wrote:
> > > >
> > > >  Go with what we have
> > > > Tilman
> > > >
> > > >
> > > >
> > > > --- Original-Nachricht ---
> > > > Von: Tim Allison
> > > > Betreff: Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2
> > > > Datum: 02. Mai 2022, 12:15
> > > > An: 
> > > >
> > > >
> > > >
> > > >
> > > > I confirmed my memory of TikaServerIntegrationTest. The difference is
> > > > that in 1.x, we were still using threads to start the server and then
> > > > hoping that<http://thread.interrupt> () actually shut the process
> down
> > > in a
> > > > reasonable amount of time. In 2.x, we're starting processes and then
> > > > force quitting those after each test. In my experience, this got rid
> > > > of the flaky tests in 2.x
> > > >
> > > > We didn't change any of the underlying server logic in 1.x. This is a
> > > > problem of flaky tests.
> > > >
> > > > Given that we have the 1.x branch through September, I'll convert the
> > > > thread based tests to process based tests shortly. The question is:
> > > > should we roll an rc3 or go with what we have.
> > > >
> > > > Thank you, all.
> > > >
> > > > Best,
> > > >
> > > > Tim
> > > >
> > > > On Sat, Apr 30, 2022 at 9:10 AM Tim Allison  > > > <mailto:talli...@apache.org> > wrote:
> > > > >
> > > > > That's exactly my understanding as well. That test is not flaky in
> 2x
> > > > > because of modifications I made to the integration tests in 2x.
> > > > >
> > > > > In 2x, the Solr tests can be flaky, and there's an open issue for
> > > > > that. I don't like flaky tests. Sorry.
> > > > >
> > > > > On Sat, Apr 30, 2022 at 6:28 AM Tilman Hausherr <
> thaush...@t-online.de
> > > > <mailto:thaush...@t-online.de> > wrote:
> > > > > >
> > > > > > Now I had successful builds with jdk11 and jdk18. I think this is
> > > more
> > > > a
> > > > > > problem with the test than with the software. I remember when I
> > > started
> > > > > > with tika I wanted to make it possible to have tika1 and 2 work
> in
> > > > > > parallel and never managed to do it and then moved on to other
> > > things.
> > > > > > Something about unreliability of the server starting and
> stopping.
> > > > > >
> > > > > > Tilman
> > > > > >
> > > > > > Am 30.04.2022 um 11:29 schrieb Tilman Hausherr:
> > > > > > > [ERROR] Tests run: 12, Failures: 0, Errors: 1, Skipped: 2, Time
> > > > > > > elapsed: 102.543 s <<< FAILURE! - in
> > > > > > ><http://org.apache.tika.server.TikaServerIntegrationTest>
> > > > > > > [ERROR]
> > > > > > >
> > > > <
> > >
> http://org.apache.tika.server.TikaServerIntegrationTest.testSameServerIdAfterOOM
> > > >
> > > > > > > Time elapsed: 5.694 s <<< ERROR!
> > > > > > ><http://java.lang.IllegalStateException> : Not a JSON Object:
> null
> > > > > > > at
> > > > > > >
> > > > <http://org.apache.tika.server.TikaServerIntegrationTest.getServerId
> > > (TikaServerIntegrationTest.java:280>
> > > > )
> > > > > > >
> > > > > > > at
> > > > > > >
> > > > <
> > >
> http://org.apache.tika.server.TikaServerIntegrationTest.testSameServerIdAfterOOM
> > > (TikaServerIntegrationTest.java:208>
> > > > )
> > > > > > >
> > > > > > > W10, jdk11, maven 3.8.5
> > > > > > >
> > > > > > >
> > > > > >
> > >
>


Re: Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2

2022-05-02 Thread Luís Filipe Nassif
Just got these build failures on Windows 10 JDK 11:

[ERROR] Failures:
[ERROR]   TextAndCSVParserTest.testSubclassingMimeTypesRemain:217
expected:<...-vcalendar; charset=[ISO-8859-1]> but was:<...-vcalendar;
charset=[windows-1252]>
[ERROR]   TXTParserTest.testSubclassingMimeTypesRemain:299
expected:<...-vcalendar; charset=[ISO-8859-1]> but was:<...-vcalendar;
charset=[windows-1252]>


Em seg., 2 de mai. de 2022 às 08:15, Tim Allison 
escreveu:

> Thank you, Tilman!
>
> I'll give it a few more hours in case anyone wants to -1 it.
>
> On Mon, May 2, 2022 at 6:20 AM Tilman Hausherr 
> wrote:
> >
> >  Go with what we have
> > Tilman
> >
> >
> >
> > --- Original-Nachricht ---
> > Von: Tim Allison
> > Betreff: Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2
> > Datum: 02. Mai 2022, 12:15
> > An: 
> >
> >
> >
> >
> > I confirmed my memory of TikaServerIntegrationTest. The difference is
> > that in 1.x, we were still using threads to start the server and then
> > hoping that () actually shut the process down
> in a
> > reasonable amount of time. In 2.x, we're starting processes and then
> > force quitting those after each test. In my experience, this got rid
> > of the flaky tests in 2.x
> >
> > We didn't change any of the underlying server logic in 1.x. This is a
> > problem of flaky tests.
> >
> > Given that we have the 1.x branch through September, I'll convert the
> > thread based tests to process based tests shortly. The question is:
> > should we roll an rc3 or go with what we have.
> >
> > Thank you, all.
> >
> > Best,
> >
> > Tim
> >
> > On Sat, Apr 30, 2022 at 9:10 AM Tim Allison  >  > wrote:
> > >
> > > That's exactly my understanding as well. That test is not flaky in 2x
> > > because of modifications I made to the integration tests in 2x.
> > >
> > > In 2x, the Solr tests can be flaky, and there's an open issue for
> > > that. I don't like flaky tests. Sorry.
> > >
> > > On Sat, Apr 30, 2022 at 6:28 AM Tilman Hausherr  >  > wrote:
> > > >
> > > > Now I had successful builds with jdk11 and jdk18. I think this is
> more
> > a
> > > > problem with the test than with the software. I remember when I
> started
> > > > with tika I wanted to make it possible to have tika1 and 2 work in
> > > > parallel and never managed to do it and then moved on to other
> things.
> > > > Something about unreliability of the server starting and stopping.
> > > >
> > > > Tilman
> > > >
> > > > Am 30.04.2022 um 11:29 schrieb Tilman Hausherr:
> > > > > [ERROR] Tests run: 12, Failures: 0, Errors: 1, Skipped: 2, Time
> > > > > elapsed: 102.543 s <<< FAILURE! - in
> > > > >
> > > > > [ERROR]
> > > > >
> > <
> http://org.apache.tika.server.TikaServerIntegrationTest.testSameServerIdAfterOOM
> >
> > > > > Time elapsed: 5.694 s <<< ERROR!
> > > > > : Not a JSON Object: null
> > > > > at
> > > > >
> >  (TikaServerIntegrationTest.java:280>
> > )
> > > > >
> > > > > at
> > > > >
> > <
> http://org.apache.tika.server.TikaServerIntegrationTest.testSameServerIdAfterOOM
> (TikaServerIntegrationTest.java:208>
> > )
> > > > >
> > > > > W10, jdk11, maven 3.8.5
> > > > >
> > > > >
> > > >
>


Re: [VOTE] Release Apache Tika 2.4.0 Candidate #1

2022-05-02 Thread Luís Filipe Nassif
Hello,

+1. Just basic stuff, built on Windows 10, Liberica JDK 11.0.13 x64.

Thank you, Tim!

Em sáb., 30 de abr. de 2022 às 05:27, David Meikle 
escreveu:

> Hi
>
> On Fri, 29 Apr 2022 at 00:23, Tim Allison  wrote:
>
>>
>> The SHA-512 checksum of the archive is
>>
>> aff68637527fa4fa1ec21678ef2771a1dcd5eb3944bc1b1171c59459274295b903e093dc63ade0b6532bf137834d32bcb9cdf0d6a32efca187b9d6b8ac64f690.
>>
>> In addition, a staged maven repository is available here:
>>
>> https://repository.apache.org/content/repositories/orgapachetika-1085/org/apache/tika
>>
>> Please vote on releasing this package as Apache Tika 2.4.0.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Tika 2.4.0
>> [ ] -1 Do not release this package because...
>>
>
> +1 - built on both Windows 11 (Java 11) and Ubuntu 22.04 (Java 11 and Java
> 17).
>
> Cheers,
> Dave
>


Re: [DISCUSS] support for Java 8?

2022-03-25 Thread Luís Filipe Nassif
sensible -> sensitive

Em sex, 25 de mar de 2022 21:15, Luís Filipe Nassif 
escreveu:

> We are moving to java 11 because it's required by Lucene 9, that has some
> features we are interested in.
>
> We use TIKA as a library, using ForkParser to protect against catastrophic
> errors. And we are receiving a lot of illegal reflective access warnings
> because of some Tika dependencies, although they are just warnings. I don't
> expect them to be solved in dependencies in a short period of time...
>
> This is a sensible upgrade, we would need to pay attention to regression
> tests, things like TIKA-3596 could happen...
>
> I'm +0, except if java 11 has something we need.
>
> Cheers
>
> Em sex, 25 de mar de 2022 15:41, Konstantin Gribov 
> escreveu:
>
>> Hi, folks.
>>
>> I'm +1 to moving to jdk11. Even some distros dropping 1.8 lately so I
>> don't
>> have strong points against letting 11 be minimal version.
>>
>> We use mix of 11 as target and 11+17 as runtime with 1.8 for some legacy
>> applications like Nexus OSS.
>>
>> I'm interested to know how much of our downstream users run Tika as a
>> library and not in isolated context like tika-server/tika-pipes.
>>
>> пт, 25 мар. 2022 г., 20:52 Eric Pugh :
>>
>> > If Java 11 makes life easier, then shipping it for Tika 2 would make
>> sense
>> > to me.   Java 8 is well….old….
>> >
>> > > On Mar 25, 2022, at 12:04 PM, Tilman Hausherr 
>> > wrote:
>> > >
>> > > Weak +1 for keeping java 8 because it's long term supported by Oracle.
>> > > Tilman
>> > >
>> > > Am 25.03.2022 um 15:46 schrieb Tim Allison:
>> > >> All,
>> > >>   I'm somewhat interested in moving to require Java 11 to clean up
>> > >> some dependency stuff.  This is not a burning need.
>> > >>  I wanted to get a sense from our community. Do we still need to
>> > >> support 8?  If so, for how long?
>> > >>
>> > >>   Cheers,
>> > >>
>> > >> Tim
>> > >
>> > >
>> >
>> > ___
>> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>> > http://www.opensourceconnections.com <
>> > http://www.opensourceconnections.com/> | My Free/Busy <
>> > http://tinyurl.com/eric-cal>
>> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>> >
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>> >
>> >
>> > This e-mail and all contents, including attachments, is considered to be
>> > Company Confidential unless explicitly stated otherwise, regardless of
>> > whether attachments are marked as such.
>> >
>> >
>>
>


Re: [DISCUSS] support for Java 8?

2022-03-25 Thread Luís Filipe Nassif
We are moving to java 11 because it's required by Lucene 9, that has some
features we are interested in.

We use TIKA as a library, using ForkParser to protect against catastrophic
errors. And we are receiving a lot of illegal reflective access warnings
because of some Tika dependencies, although they are just warnings. I don't
expect them to be solved in dependencies in a short period of time...

This is a sensible upgrade, we would need to pay attention to regression
tests, things like TIKA-3596 could happen...

I'm +0, except if java 11 has something we need.

Cheers

Em sex, 25 de mar de 2022 15:41, Konstantin Gribov 
escreveu:

> Hi, folks.
>
> I'm +1 to moving to jdk11. Even some distros dropping 1.8 lately so I don't
> have strong points against letting 11 be minimal version.
>
> We use mix of 11 as target and 11+17 as runtime with 1.8 for some legacy
> applications like Nexus OSS.
>
> I'm interested to know how much of our downstream users run Tika as a
> library and not in isolated context like tika-server/tika-pipes.
>
> пт, 25 мар. 2022 г., 20:52 Eric Pugh :
>
> > If Java 11 makes life easier, then shipping it for Tika 2 would make
> sense
> > to me.   Java 8 is well….old….
> >
> > > On Mar 25, 2022, at 12:04 PM, Tilman Hausherr 
> > wrote:
> > >
> > > Weak +1 for keeping java 8 because it's long term supported by Oracle.
> > > Tilman
> > >
> > > Am 25.03.2022 um 15:46 schrieb Tim Allison:
> > >> All,
> > >>   I'm somewhat interested in moving to require Java 11 to clean up
> > >> some dependency stuff.  This is not a burning need.
> > >>  I wanted to get a sense from our community. Do we still need to
> > >> support 8?  If so, for how long?
> > >>
> > >>   Cheers,
> > >>
> > >> Tim
> > >
> > >
> >
> > ___
> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> > http://www.opensourceconnections.com <
> > http://www.opensourceconnections.com/> | My Free/Busy <
> > http://tinyurl.com/eric-cal>
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> >
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless of
> > whether attachments are marked as such.
> >
> >
>


Re: [DISCUSS] upgrading log4j to to log4j2 in Tika's 1.x branch

2021-12-15 Thread Luís Filipe Nassif
Great, Thank you, Tim!

Em qua., 15 de dez. de 2021 às 16:50, Tim Allison 
escreveu:

> I've merged Lewis's edits to the README and added the EOL.  Let's do
> what both Konstantin and Nick recommend: README, notifications to
> user/dev lists x months out and include EOL in all release messages?
>
> Please let me know/edit the README if there are other improvements we
> should make.
>
> Thank you, all!
>
> Cheers,
>
>  Tim
>
> On Wed, Dec 15, 2021 at 1:20 PM Konstantin Gribov 
> wrote:
> >
> > My +1 to EOL on September 30, 2022 with effective backport submission
> > freeze 3 months before that.
> >
> > I think it would be better if we mention the EOL timeline at least in 3
> > places: in each release announcement, in README and on the site (on the
> > main page or in release news articles). Different downstream users look
> at
> > different sources, so more visibility seems to be a good idea to me. I
> saw
> > a lot of projects still using log4j 1.2.x in the wild and have a feeling
> > that it's partially due to lack of visibility about its EOL.
> >
> > Also we can send a message to announce@a.o (if it's not discouraged by
> ASF
> > policies, I don't recall if somebody did something similar before),
> > user@tika.a.o and dev@tika.a.o 6 and 3 months before EOL date.
> >
> > --
> > Best regards,
> > Konstantin Gribov.
> >
> >
> > On Wed, Dec 15, 2021 at 9:00 PM Nick Burch  wrote:
> >
> > > On Wed, 15 Dec 2021, Tim Allison wrote:
> > > > Sounds good, Nick.  Unless there are objections, I'll add an EOL
> > > > September 30, 2022 for the 1.x branch on our github README and maybe
> our
> > > > site somewhere?
> > >
> > > Maybe just mention it in the news section at the end any 1.x fix
> releases?
> > >
> > > Nick
> > >
>


Re: [DISCUSS] upgrading log4j to to log4j2 in Tika's 1.x branch

2021-12-14 Thread Luís Filipe Nassif
Sorry about the additional work, Tim. I thought upgrading from log4j-1.x to
2.x on Tika-1.x maybe could not be that hard and didn't know about breaking
changes.

Related to Eric's email, would we support Tika-1.x security updates for
some while (that was my intent with the proposal above)? Was this already
discussed?

Best regards,
Luis Filipe



Em seg., 13 de dez. de 2021 às 17:23, Tim Allison 
escreveu:

> Yes.  That was the reasoning behind my -0.  I don't think this will
> destroy our resources, but yes, please do migrate to 2.x asap.
>
>
> On Mon, Dec 13, 2021 at 3:13 PM Eric Pugh
>  wrote:
> >
> > Isn’t the goal of Tika 2 to mean that we no longer work on Tika 1?
>  Does the Tika community have enough developer bandwidth to continue to
> maintain Tika 1 while also pushing forward on Tika 2?
> >
> > I worry that we’ll fall into that situation where people just end up
> using Tika 1 for forever, especially if there are new updates to it that
> are happening, which then encourages folks not to move to Tika 2.
> >
> >
> >
> >
> > > On Dec 13, 2021, at 2:49 PM, Tim Allison  wrote:
> > >
> > > Sounds like 2 +1 to my -0. :D  I'll start working on this now.
> > >
> > > On Mon, Dec 13, 2021 at 2:09 PM Nicholas DiPiazza
> > >  wrote:
> > >>
> > >> I prefer upgrade to log4j2
> > >>
> > >> On Mon, Dec 13, 2021, 12:05 PM Tim Allison 
> wrote:
> > >>
> > >>> All,
> > >>>  I'm currently in the process of building the rc1 for Tika 2.x. On
> > >>> TIKA-3616, Luís Filipe Nassif asked if we could upgrade log4j to
> > >>> log4j2 in the 1.x branch.  I think we avoided that because it would
> be
> > >>> a breaking change(?).  There are security vulns in log4j and it hit
> > >>> EOL
> > >>> in August 2015.
> > >>>  Should we upgrade the Tika 1.x branch for log4j2?
> > >>>
> > >>>  Best,
> > >>>
> > >>>   Tim
> > >>>
> > >>>
> > >>> [1]
> > >>>
> https://issues.apache.org/jira/browse/TIKA-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457595#comment-17457595
> > >>>
> >
> > ___
> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> > This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
> >
>


Re: new committer: Nicholas DiPiazza

2021-06-03 Thread Luís Filipe Nassif
Welcome on board, Nicholas. Great work!

Best regards,
Luis Filipe Nassif

Em qui., 3 de jun. de 2021 às 16:00, Nicholas DiPiazza <
nicholas.dipia...@gmail.com> escreveu:

> Hi Everyone!
>
> Happy to be one of the committers for Tika!
>
> My name is Nicholas DiPiazza - I reside in Madison, Wisconsin USA. My name
> is Sicilian in origin, and I look Italian... but I'm actually 50% Irish!
>
> I started doing Tika contributions through my work at Lucidworks (
> https://lucidworks.com) building connectors that grab content from various
> data sources such as SharePoint, Google Drive, OneDrive, Alfresco, etc.,
> parsing it using Apache Tika, and eventually indexing it into Solr.
>
> I primarily do back-end Java work but also do work in various languages and
> frameworks. Most recently I have been doing a lot in Scala and Spark.
>
> I have been having a lot of fun making Tika work at a massive scale inside
> Kube containers! I scraped together a homemade version of the Tika Pipes
> project in Tika 1.x to meet some needs I had, and then later collaborated
> with Tim Allison to get this into Tika 2.x. Super stoked to get this in a
> major version of Tika.
>
> Other stuff about me:
>
> I play drums for a metal band called Wake and Prevail
> https://www.reverbnation.com/wakeandprevail although the Covid situation
> has put music on hold indefinitely, I still jam to albums in my basement
> regularly.
>
> I play Starcraft 2 in my spare time, but am stuck in the Diamond League as
> I don't want to hurt my fingers/wrists getting my APM any higher.
>
> I prefer Ubuntu, Windows then Mac in that order. My Mac is actually in a
> box back from when I moved and I have managed not to need it for several
> months now.
>
>
> Looking forward to doing even more contributions throughout the next couple
> years, in particular improving our DWG support and improving the OneNote
> parsing. And hoping to create Tika Pipes tutorials hopefully to help get
> lots of people using that feature so we can get lots of contributions to
> improve it.
>
> Thanks!
> -Nichiolas
>
> On Thu, Jun 3, 2021 at 1:18 PM Tim Allison  wrote:
>
> > The Project Management Committee (PMC) for Apache Tika
> > has invited Nicholas DiPiazza to become a committer and we are pleased
> > to announce that he has accepted.
> >
> > Nicholas has made numerous contributions including the OneNoteParser,
> > and, more recently, the Solr pipes modules.  We look forward to continued
> > collaboration to make Tika more robust and scaleable.
> >
> > Being a committer enables easier contribution to the
> > project since there is no need to go via the patch
> > submission process. This should enable better productivity.
> > Being a PMC member enables assistance with the management
> > and to guide the direction of the project.
> >
> > Welcome aboard, Nicholas!  Please share a bit about yourself.
> >
> > Cheers,
> >
> >Tim
> >
>


Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-26 Thread Luís Filipe Nassif
Yes, tika-server is the long way choice, as discussed in user's list recent
thread. I hope I will have time in the future to migrate to it to get rid
of jar hell problems definitely...

Em qui., 26 de nov. de 2020 às 14:32, Nicholas DiPiazza <
nicholas.dipia...@gmail.com> escreveu:

> I created a tika fork example I want to add to the documentation as well:
> https://github.com/nddipiazza/tika-fork-parser-example
>
> When we submit your fixes, we should update this example with
> multi-threading.
>
> On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza <
> nicholas.dipia...@gmail.com> wrote:
>
>> Hey Luis,
>>
>> It is related because after your fixes I might be able to take some
>> significant performance advantage by switching to fork parser.
>> I would make great use of an example of someone else who has set up a
>> ForkParser multi-thread able processing program that can gracefully handle
>> the huge onslaught that is my use case.
>> But at this point, I doubt I'll switch from Tika Server anyways because I
>> invested some time creating a wrapper around it and it is performing very
>> well.
>>
>> On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif 
>> wrote:
>>
>>> Not what you asked but related :)
>>>
>>> Luis
>>>
>>> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif >> >
>>> escreveu:
>>>
>>> > I've done some few improvements in ForkParser performance in an
>>> internal
>>> > fork. Will try to contribute upstream...
>>> >
>>> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
>>> > nicholas.dipia...@gmail.com> escreveu:
>>> >
>>> >> I am attempting to Tika parse dozens of millions of office documents.
>>> >> Pdfs,
>>> >> docs, excels, xmls, etc. Wide assortment of types.
>>> >>
>>> >> Throughput is very important. I need to be able parse these files in a
>>> >> reasonable amount of time, but at the same time, accuracy is also
>>> pretty
>>> >> important. I hope to have less than 10% of the documents parsed fail.
>>> (And
>>> >> by fail I mean fail due to tika stability, like a timeout while
>>> parsing. I
>>> >> do not mean fail due to the document itself).
>>> >>
>>> >> My question - how to configure Tika Server in a containerized
>>> environment
>>> >> to maximize throughput?
>>> >>
>>> >> My environment:
>>> >>
>>> >>- I am using Openshift.
>>> >>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory:
>>> *8
>>> >>GiB to 10 GiB*.
>>> >>- I have 10 tika parsing pod replicas.
>>> >>
>>> >> On each pod, I run a java program where I have 8 parse threads.
>>> >>
>>> >> Each thread:
>>> >>
>>> >>- Starts a single tika server process (in spawn child mode)
>>> >>   - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>>> >> 12
>>> >>   -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis
>>> 500
>>> >>   -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures
>>> >> -enableFileUrl
>>> >>- The thread will now continuously grab a file from the
>>> files-to-fetch
>>> >>queue and will send it to the tika server, stopping when there are
>>> no
>>> >> more
>>> >>files to parse.
>>> >>
>>> >> Each of these files are stored locally on the pod in a buffer, so the
>>> >> local
>>> >> file optimization is used:
>>> >>
>>> >> The Tika web service it is using is:
>>> >>
>>> >> Endpoint: `/rmeta/text`
>>> >> Method: `PUT`
>>> >> Headers:- writeLimit = 3200- maxEmbeddedResources = 0-
>>> >> fileUrl = file:///path/to/file
>>> >>
>>> >> Files are no greater than 100Mb, the maximum number of bytes tika text
>>> >> will
>>> >> be (writeLimit) 32Mb.
>>> >>
>>> >> Each pod is parsing about 370,000 documents per day. I've been messing
>>> >> with
>>> >> a ton of different attempts at settings.
>>> >>
>>> >> I previously tried to use the actual Tika "ForkParser" but the
>>> performance
>>> >> was far worse than spawning tika servers. So that is why I am using
>>> Tika
>>> >> Server.
>>> >>
>>> >> I don't hate the performance results of this but I feel like I'd
>>> >> better
>>> >> reach out and make sure there isn't someone out there who sanity
>>> checks my
>>> >> numbers and is like "woah that's awful performance, you should be
>>> getting
>>> >> xyz like me!"
>>> >>
>>> >> Anyone have any similar things you are doing? If so, what settings
>>> did you
>>> >> end up settling on?
>>> >>
>>> >> Also, I'm wondering if Apache Http Client would be causing any
>>> overhead
>>> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am
>>> using
>>> >> a
>>> >> shared connection pool. Would there be any benefit in say using a
>>> unique
>>> >> HttpClients.createDefault() for each thread instead of sharing a
>>> >> connection
>>> >> pool between the threads?
>>> >>
>>> >>
>>> >> Cross posted question here as well
>>> >>
>>> >>
>>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>>> >>
>>> >
>>>
>>


Re: [ANNOUNCE] Welcome Peter Lee as Tika PMC member and committer

2020-11-26 Thread Luís Filipe Nassif
Thank you, Peter, for all your contributions and welcome!

Em qua., 25 de nov. de 2020 às 23:21, Chris Mattmann 
escreveu:

> Welcome Peter! 
>
>
>
>
>
>
>
> *From: *Peter Lee 
> *Reply-To: *
> *Date: *Wednesday, November 25, 2020 at 6:08 PM
> *To: *"dev@tika.apache.org" , "talli...@apache.org" <
> talli...@apache.org>
> *Cc: *"u...@tika.apache.org" 
> *Subject: *Re: [ANNOUNCE] Welcome Peter Lee as Tika PMC member and
> committer
>
>
>
> Many thanks to you, Tim. :)
>
>
>
> Hi, all
>
>
>
> I'm Peter Lee and I was a Apache Commons committer. I'm familiar with many
> archivers and compressors. Feel free to ask me if you have some problems in
> compression.
>
>
>
> I'm honored to be part of Tika. Tika is great and it helped me a lot.
> Besides, Tika is a great community and it has helped a lot of users. I hope
> I can help Tika a little bit.
>
>
>
> Once again, thank you all for making such a great community!
>
>
>
> cheers,
>
> Lee
>
> On 11 25 2020, at 9:27, Tim Allison  wrote:
>
> All,
>
>
>
> The Tika PMC has elected to add Peter Lee to our ranks.
>
>
>
> Lee,
>
> Please introduce yourself, and welcome aboard!
>
>
>
> Cheers,
>
>
>
> Tim
>
>


Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-25 Thread Luís Filipe Nassif
Not what you asked but related :)

Luis

Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif 
escreveu:

> I've done some few improvements in ForkParser performance in an internal
> fork. Will try to contribute upstream...
>
> Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
> nicholas.dipia...@gmail.com> escreveu:
>
>> I am attempting to Tika parse dozens of millions of office documents.
>> Pdfs,
>> docs, excels, xmls, etc. Wide assortment of types.
>>
>> Throughput is very important. I need to be able parse these files in a
>> reasonable amount of time, but at the same time, accuracy is also pretty
>> important. I hope to have less than 10% of the documents parsed fail. (And
>> by fail I mean fail due to tika stability, like a timeout while parsing. I
>> do not mean fail due to the document itself).
>>
>> My question - how to configure Tika Server in a containerized environment
>> to maximize throughput?
>>
>> My environment:
>>
>>- I am using Openshift.
>>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
>>GiB to 10 GiB*.
>>- I have 10 tika parsing pod replicas.
>>
>> On each pod, I run a java program where I have 8 parse threads.
>>
>> Each thread:
>>
>>- Starts a single tika server process (in spawn child mode)
>>   - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>> 12
>>   -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis 500
>>   -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures
>> -enableFileUrl
>>- The thread will now continuously grab a file from the files-to-fetch
>>queue and will send it to the tika server, stopping when there are no
>> more
>>files to parse.
>>
>> Each of these files are stored locally on the pod in a buffer, so the
>> local
>> file optimization is used:
>>
>> The Tika web service it is using is:
>>
>> Endpoint: `/rmeta/text`
>> Method: `PUT`
>> Headers:- writeLimit = 3200- maxEmbeddedResources = 0-
>> fileUrl = file:///path/to/file
>>
>> Files are no greater than 100Mb, the maximum number of bytes tika text
>> will
>> be (writeLimit) 32Mb.
>>
>> Each pod is parsing about 370,000 documents per day. I've been messing
>> with
>> a ton of different attempts at settings.
>>
>> I previously tried to use the actual Tika "ForkParser" but the performance
>> was far worse than spawning tika servers. So that is why I am using Tika
>> Server.
>>
>> I don't hate the performance results of this but I feel like I'd
>> better
>> reach out and make sure there isn't someone out there who sanity checks my
>> numbers and is like "woah that's awful performance, you should be getting
>> xyz like me!"
>>
>> Anyone have any similar things you are doing? If so, what settings did you
>> end up settling on?
>>
>> Also, I'm wondering if Apache Http Client would be causing any overhead
>> here when I am calling to my Tika Server /rmeta/text endpoint. I am using
>> a
>> shared connection pool. Would there be any benefit in say using a unique
>> HttpClients.createDefault() for each thread instead of sharing a
>> connection
>> pool between the threads?
>>
>>
>> Cross posted question here as well
>>
>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>>
>


Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-25 Thread Luís Filipe Nassif
I've done some few improvements in ForkParser performance in an internal
fork. Will try to contribute upstream...

Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
nicholas.dipia...@gmail.com> escreveu:

> I am attempting to Tika parse dozens of millions of office documents. Pdfs,
> docs, excels, xmls, etc. Wide assortment of types.
>
> Throughput is very important. I need to be able parse these files in a
> reasonable amount of time, but at the same time, accuracy is also pretty
> important. I hope to have less than 10% of the documents parsed fail. (And
> by fail I mean fail due to tika stability, like a timeout while parsing. I
> do not mean fail due to the document itself).
>
> My question - how to configure Tika Server in a containerized environment
> to maximize throughput?
>
> My environment:
>
>- I am using Openshift.
>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
>GiB to 10 GiB*.
>- I have 10 tika parsing pod replicas.
>
> On each pod, I run a java program where I have 8 parse threads.
>
> Each thread:
>
>- Starts a single tika server process (in spawn child mode)
>   - Tika server arguments: -s -spawnChild -maxChildStartupMillis 12
>   -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis 500
>   -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures
> -enableFileUrl
>- The thread will now continuously grab a file from the files-to-fetch
>queue and will send it to the tika server, stopping when there are no
> more
>files to parse.
>
> Each of these files are stored locally on the pod in a buffer, so the local
> file optimization is used:
>
> The Tika web service it is using is:
>
> Endpoint: `/rmeta/text`
> Method: `PUT`
> Headers:- writeLimit = 3200- maxEmbeddedResources = 0-
> fileUrl = file:///path/to/file
>
> Files are no greater than 100Mb, the maximum number of bytes tika text will
> be (writeLimit) 32Mb.
>
> Each pod is parsing about 370,000 documents per day. I've been messing with
> a ton of different attempts at settings.
>
> I previously tried to use the actual Tika "ForkParser" but the performance
> was far worse than spawning tika servers. So that is why I am using Tika
> Server.
>
> I don't hate the performance results of this but I feel like I'd better
> reach out and make sure there isn't someone out there who sanity checks my
> numbers and is like "woah that's awful performance, you should be getting
> xyz like me!"
>
> Anyone have any similar things you are doing? If so, what settings did you
> end up settling on?
>
> Also, I'm wondering if Apache Http Client would be causing any overhead
> here when I am calling to my Tika Server /rmeta/text endpoint. I am using a
> shared connection pool. Would there be any benefit in say using a unique
> HttpClients.createDefault() for each thread instead of sharing a connection
> pool between the threads?
>
>
> Cross posted question here as well
>
> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>


Re: [ANNOUNCE] Welcome Tilman Hausherr as Tika PMC member and committer

2019-10-06 Thread Luís Filipe Nassif
Welcome, Tilman!

Em sex, 4 de out de 2019 15:37, Tilman Hausherr 
escreveu:

> Am 04.10.2019 um 16:19 schrieb Tim Allison:
> > All,
> >
> > The Tika PMC has elected to add Tilman Hausherr to our ranks.  Tilman,
> > please feel free to introduce yourself, and welcome aboard!
> >
> > Cheers,
> >
> >   Tim
>
> Hello everybody,
>
> Thanks for the honor. A bit about me: I'm from Germany (coincidentally,
> yesterday was our national holiday, 29 years of reunification), I'm 50+
> years old, studied CS at TU Berlin, still living in Berlin, and now
> working at an IT company where my main job is document capture /
> classification / processing for our clients. I have known tika mostly
> from the PDFBox issues. Because I know next to nothing about the tika
> code I'll probably focus on refactoring / build issues (e.g. version
> updates) / documentation.
>
> Best regards
>
> Tilman Hausherr
>
>


Re: 1.20?

2018-12-13 Thread Luís Filipe Nassif
Hi Tim,

Reading your great reports, I also saw some new exceptions with RAR files
in likely broken folder, but seems tika was able to extract some text from
them before. Do you know if those files are really broken and why tika
extracted text from them before?

Thank you,
Luis

Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
escreveu:

> Reports are here:
>
> http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>
> I'm going to revert the mp4 parser, and commit the few dependency
> upgrades I ran.
>
> The _major_ difference in content for ppt is explained by the
> duplication of header/footer info.  To confirm this, note that the
> values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> identical for nearly all ppt->ppt, but there are far more tokens in
> "num_tokens_a" vs "num_tokens_b".
>
> I also see that we're losing content in x-java and x-groovy, etc., but
> that's because we're now suppressing the style markup that our parser
> was (incorrectly, IMHO, inserting) -- check the values in
> "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> weight: 3 | family: 2
>
> In short, I think we're good to go.  Will roll rc1 later today or
> (more likely) tomorrow unless there are objections.
> On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
> >
> > Any blockers on 1.20?  I'm going to kick off the regression tests
> shortly.
> > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
> > >
> > > Hi,
> > > On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:
> > >
> > > > Dave,
> > > >   Should I try to get the Docker plugin working again?
> > > >
> > >
> > > That would be great. I think I may have went down the wrong path
> building
> > > an image at package time, as there doesn't seem to be an easy way to
> > > publish it as an Apache labelled org on Dockerhub unless it builds from
> > > source.
> > >
> > > I have some time over the weekend, so could update to where I got to
> and
> > > see what you think.
> > >
> > > Cheers,
> > > Dave
>


Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Luís Filipe Nassif
Hi Tim,

Could you clarify the pros and cons between ForkParser (after your
refactoring) and TikaServer? Maybe we should send those to users list and
wiki...

Thanks

2018-05-29 16:27 GMT-03:00 Tim Allison :

> Ken,
>   Once TIKA-2653 is done and 1.19(?) is released, I'll propose switching
> ERH to the ForkParser.  There's also an open ticket for using tika-server.
> I think users should have both options.
>
> On Tue, May 29, 2018 at 3:25 PM, Tim Allison  wrote:
>
>> 1: CORRECTION: the ForkParser by itself (without my mods) will protect
>> against ooms, permanent hangs, and native lib crashing.  My proposed mods (on
>> TIKA-2653) only move the parser dependencies out of Solr's dependencies.
>>
>> 2: note: Also, note the discussion on where to place this information.
>> Cassandra Targett advocates putting this guidance in the main users' guide.
>>
>> On Tue, May 29, 2018 at 3:22 PM, Tim Allison  wrote:
>>
>>> Y, my mods to the ForkParser should make it more robust, and will help
>>> with OOMs, permanent hangs and native lib crashing.  But those changes are
>>> still in the works...
>>>
>>> On Tue, May 29, 2018 at 3:18 PM, Luís Filipe Nassif >> > wrote:
>>>
>>>> Hi Ken,
>>>>
>>>> Threads will not help with OutOfMemoryErrors or crashes caused by native
>>>> libs. ForkParser can help, after the refactoring started by Tim to
>>>> handle
>>>> some of its limitations. See TIKA-2653
>>>>
>>>> 2018-05-29 16:11 GMT-03:00 Ken Krugler :
>>>>
>>>> > Thanks for the ref, Tim.
>>>> >
>>>> > I’m curious why SolrCell doesn’t fire up threads when parsing docs
>>>> with
>>>> > Tika (or use the fork parser), to mitigate issues with hangs &
>>>> crashes?
>>>> >
>>>> > — Ken
>>>> >
>>>> > > On May 29, 2018, at 11:54 AM, Tim Allison 
>>>> wrote:
>>>> > >
>>>> > > All,
>>>> > >
>>>> > >  Over the weekend, Shawn Heisey very kindly drafted a wikipage
>>>> about the
>>>> > > challenges of using Solr's ExtractingRequestHandler and the
>>>> guidance to
>>>> > > avoid it in production.
>>>> > >
>>>> > >   I completely agree with this point, and I think that Shawn did a
>>>> very
>>>> > > nice job of capturing some of the challenges.  If you have any
>>>> feedback
>>>> > or
>>>> > > would like to make edits, see:
>>>> > >
>>>> > > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>>>> > >
>>>> > >   Cheers,
>>>> > >
>>>> > > Tim
>>>> >
>>>> > 
>>>> > http://about.me/kkrugler
>>>> > +1 530-210-6378
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>


Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Luís Filipe Nassif
Related to this, do we have any guidance to help java users choosing
between ForkParser or TikaServer?

2018-05-29 16:18 GMT-03:00 Luís Filipe Nassif :

> Hi Ken,
>
> Threads will not help with OutOfMemoryErrors or crashes caused by native
> libs. ForkParser can help, after the refactoring started by Tim to handle
> some of its limitations. See TIKA-2653
>
> 2018-05-29 16:11 GMT-03:00 Ken Krugler :
>
>> Thanks for the ref, Tim.
>>
>> I’m curious why SolrCell doesn’t fire up threads when parsing docs with
>> Tika (or use the fork parser), to mitigate issues with hangs & crashes?
>>
>> — Ken
>>
>> > On May 29, 2018, at 11:54 AM, Tim Allison  wrote:
>> >
>> > All,
>> >
>> >  Over the weekend, Shawn Heisey very kindly drafted a wikipage about the
>> > challenges of using Solr's ExtractingRequestHandler and the guidance to
>> > avoid it in production.
>> >
>> >   I completely agree with this point, and I think that Shawn did a very
>> > nice job of capturing some of the challenges.  If you have any feedback
>> or
>> > would like to make edits, see:
>> >
>> > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>> >
>> >   Cheers,
>> >
>> > Tim
>>
>> 
>> http://about.me/kkrugler
>> +1 530-210-6378
>>
>>
>


Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Luís Filipe Nassif
Hi Ken,

Threads will not help with OutOfMemoryErrors or crashes caused by native
libs. ForkParser can help, after the refactoring started by Tim to handle
some of its limitations. See TIKA-2653

2018-05-29 16:11 GMT-03:00 Ken Krugler :

> Thanks for the ref, Tim.
>
> I’m curious why SolrCell doesn’t fire up threads when parsing docs with
> Tika (or use the fork parser), to mitigate issues with hangs & crashes?
>
> — Ken
>
> > On May 29, 2018, at 11:54 AM, Tim Allison  wrote:
> >
> > All,
> >
> >  Over the weekend, Shawn Heisey very kindly drafted a wikipage about the
> > challenges of using Solr's ExtractingRequestHandler and the guidance to
> > avoid it in production.
> >
> >   I completely agree with this point, and I think that Shawn did a very
> > nice job of capturing some of the challenges.  If you have any feedback
> or
> > would like to make edits, see:
> >
> > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
> >
> >   Cheers,
> >
> > Tim
>
> 
> http://about.me/kkrugler
> +1 530-210-6378
>
>


Re: Tika 1.18?

2018-03-07 Thread Luís Filipe Nassif
I thought about logging any custom-mimetype override applied, so the user
will be warned about that. Maybe additionally creating a specific attribute
in mimetype definition xml to configure it must override the default one
instead of aborting. About multiple conflicting custom mimes from different
(external) projetcs, Tika currently aborts and it is already a problem now.

So I think it needs additional discussion and should not be addressed in
the next release. Will copy/paste this discussion in the jira issue.

But I would like to see fixed the detection of MTS videos, but it conflicts
with another existing mime glob. Any workaround for this specific case? If
yes, I can open a different ticket.



Em 2 de mar de 2018 18:23, "Nick Burch" <apa...@gagravarr.org> escreveu:

On Fri, 2 Mar 2018, Luís Filipe Nassif wrote:

> If I make no progress on TIKA-1466 until 3/9, you can start the release
> process without it. But do you devs agree with the proposed change: allow
> overriding of glob patterns in custom-mimetypes.xml?
>

What happens if you have two different custom files which both claim the
same glob?

We have historically been a bit stricter about built-in types overriding,
in part to avoid people doing silly things by mistake, and in part to push
people a bit more towards contributing fixes/enhancements for built-in
types. I think the latter is less of a thing today, as we've a lot more
covered as standard, so it's just the former we need to worry about.

How do we help people know when they have conflicting overrides (possibly
from different projects), help them sensibly merge or turn off Tika
provided magic+definitions, and to alert them to when their copied +
customised version probably wants updating following a tika upgrade giving
a newer definition? Do a better job of those than we currently do now, then
I'm very happy to +1 it :)

Nick


Re: Tika 1.18?

2018-03-01 Thread Luís Filipe Nassif
I think we should workaround TIKA-2591, and I would like to work
on TIKA-1466 (what do you think?) and fix TIKA-2568.

Cheers,
Luis


Livre
de vírus. www.avast.com
.
<#m_3134801720618142664_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

2018-03-01 13:24 GMT-03:00 Chris Mattmann :

> Same: makes perfect sense to me and let's do it ( I just updated (finally)
> Tika Python down
> stream to be based on the 1.16 Tika, I guess I should get it based on 1.17
> soon too (
>
> https://github.com/chrismattmann/tika-python/blob/master/tika/__init__.py#
> L17
>
> Cheers,
> Chris
>
> On 3/1/18, 5:16 AM, "Nick Burch"  wrote:
>
> On Thu, 1 Mar 2018, Allison, Timothy B. wrote:
> > There have been some important bug fixes, a few new capabilities, and
> > the upgrading of dependencies because of CVEs.  There are a bunch of
> > mime tickets from Andreas Meier that I’d like to get into 1.18.  Is
> > there anything else that is critical?
>
> I've had a busy few weeks, so haven't yet had a chance to try out my
> proposed multi-parser stuff for 2.x. I'll hopefully take a look next
> week,
> assuming even the fastest review cycle and everyone loving it, I can't
> see
> us being ready to all sign-off on those "2.x breaking changes" until
> probably April.
>
> Given that, doing an interim 1.x release soon makes sense to me!
>
> Nick
>
>
>


Re: Not-yet-broken breaking changes for Tika 2?

2018-02-07 Thread Luís Filipe Nassif
Mine too, but I know it is important for many use cases. Maybe adding to
XHtmlContentHandler some tracking of open tags and a new method to close
them?

2018-02-07 12:59 GMT-02:00 Allison, Timothy B. :

> Do we worry about properly closing tags on an exception?
>
> 
> 
> 
> kaboom
> mailto:lfcnas...@gmail.com]
> Sent: Monday, February 5, 2018 5:34 PM
> To: dev@tika.apache.org
> Subject: Re: Not-yet-broken breaking changes for Tika 2?
>
> From a forensic use case it is better just saying we are trying another
> parser and not resetting the content handler, because the first parser can
> extract relevant content before the exception.
>
> To not spool everything to temp files to re-read the stream, I think we
> can create an optional setinputstreamfactory() method in TikaInputStream,
> so the user can implement an InputStreamFactory interface with a
> getInputStream method, if he does not want to pay a performance hit with
> temp files for everything.
>
> Luis
>
> Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" 
> escreveu:
>
> I think we should just say, OK now we're trying  a different parser
>
>
>
> On 2/5/18, 9:51 AM, "Allison, Timothy B."  wrote:
>
> To my mind, the real challenge is what to do with content that should
> be ignored...
>
> If the strategy is back-off-on-exception (try the DOCX parser, but if
> there's an exception, use the Zip parser), what do we do with the sax
> elements that have already been written?  Do we need a new handler type
> that has a reset() method?
>
> Or do we just say, hey, now we're trying a different parser...
>
>
> -Original Message-
> From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov]
> Sent: Monday, February 5, 2018 12:29 PM
> To: dev@tika.apache.org
> Subject: Re: Not-yet-broken breaking changes for Tika 2?
>
> Our solution is just to run the parser 2xyes I get it will induce
> overhead, but as a start, why not?
> In short just run through the stream 2x
>
> 
> ++
> Chris Mattmann, Ph.D.
> Associate Chief Technology and Innovation Officer, OCIO Manager,
> Advanced IT Research and Open Source Projects Office (1761) Manager, NSF
> and Open Source Programs and Applications Office (8212) NASA Jet Propulsion
> Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-502
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> 
> ++
> Director, Information Retrieval and Data Science Group (IRDS) Adjunct
> Associate Professor, Computer Science Department University of Southern
> California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> 
> ++
>
>
> On 2/5/18, 9:25 AM, "Nick Burch"  wrote:
>
> On Mon, 5 Feb 2018, Chris Mattmann wrote:
> > Let's have a go at implementing it! You know my thoughts (make
> it like
> > OODT ;) )\
>
> I'm still keen to hear how we can do the text content like OODT!
>
> I have tried to copy the OODT model for the proposed metadata case
> though
> :)
>
> Nick
>
> > On 2/5/18, 8:37 AM, "Nick Burch"  wrote:
> >
> >Ping - anyone got any thoughts on the proposed metadata parser
> stuff, and
> >any ideas on the content part?
> >
> >On Tue, 2 Jan 2018, Nick Burch wrote:
> >> On Thu, 26 Oct 2017, Chris Mattmann wrote:
> >>> On collision, the precedence order defines what key takes
> precedence and
> >>> _overwrites_ the other. Overwrite is but one option (you
> could save *all*
> >>> the values it’s a multi-valued key structure so…)
> >>
> >> OK, I think that's fine. I've had a go at updating the wiki
> for the metadata
> >> case:
> >> https://wiki.apache.org/tika/CompositeParserDiscussion#
> Supplementary.2FAdditive
> >> And example Tika Config settings for it
> >> https://wiki.apache.org/tika/CompositeParserDiscussion#
> line-20
> >> If people are happy with how that sounds/looks, I can have a
> stab at
> >> implementing it, as I *think* it's quite easy
> >>
> >>
> >> However... that still leaves the Context (XHTML SAX events)
> case to solve!
> >>
> >> Anyone have any ideas on how we can append to or
> cancel/reset the Content
> >> Handler series of SAX events when we move onto a second+
> parser for a file?
> >>
> >> Thanks
> >> Nick
> >>
> >>> On 10/26/17, 

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Luís Filipe Nassif
>From a forensic use case it is better just saying we are trying another
parser and not resetting the content handler, because the first parser can
extract relevant content before the exception.

To not spool everything to temp files to re-read the stream, I think we can
create an optional setinputstreamfactory() method in TikaInputStream, so
the user can implement an InputStreamFactory interface with a
getInputStream method, if he does not want to pay a performance hit with
temp files for everything.

Luis

Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" 
escreveu:

I think we should just say, OK now we're trying  a different parser



On 2/5/18, 9:51 AM, "Allison, Timothy B."  wrote:

To my mind, the real challenge is what to do with content that should
be ignored...

If the strategy is back-off-on-exception (try the DOCX parser, but if
there's an exception, use the Zip parser), what do we do with the sax
elements that have already been written?  Do we need a new handler type
that has a reset() method?

Or do we just say, hey, now we're trying a different parser...


-Original Message-
From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, February 5, 2018 12:29 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?

Our solution is just to run the parser 2xyes I get it will induce
overhead, but as a start, why not?
In short just run through the stream 2x


++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer, OCIO Manager,
Advanced IT Research and Open Source Projects Office (1761) Manager, NSF
and Open Source Programs and Applications Office (8212) NASA Jet Propulsion
Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/

++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct
Associate Professor, Computer Science Department University of Southern
California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/

++


On 2/5/18, 9:25 AM, "Nick Burch"  wrote:

On Mon, 5 Feb 2018, Chris Mattmann wrote:
> Let's have a go at implementing it! You know my thoughts (make it
like
> OODT ;) )\

I'm still keen to hear how we can do the text content like OODT!

I have tried to copy the OODT model for the proposed metadata case
though
:)

Nick

> On 2/5/18, 8:37 AM, "Nick Burch"  wrote:
>
>Ping - anyone got any thoughts on the proposed metadata parser
stuff, and
>any ideas on the content part?
>
>On Tue, 2 Jan 2018, Nick Burch wrote:
>> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>> On collision, the precedence order defines what key takes
precedence and
>>> _overwrites_ the other. Overwrite is but one option (you
could save *all*
>>> the values it’s a multi-valued key structure so…)
>>
>> OK, I think that's fine. I've had a go at updating the wiki
for the metadata
>> case:
>> https://wiki.apache.org/tika/CompositeParserDiscussion#
Supplementary.2FAdditive
>> And example Tika Config settings for it
>> https://wiki.apache.org/tika/CompositeParserDiscussion#
line-20
>> If people are happy with how that sounds/looks, I can have a
stab at
>> implementing it, as I *think* it's quite easy
>>
>>
>> However... that still leaves the Context (XHTML SAX events)
case to solve!
>>
>> Anyone have any ideas on how we can append to or
cancel/reset the Content
>> Handler series of SAX events when we move onto a second+
parser for a file?
>>
>> Thanks
>> Nick
>>
>>> On 10/26/17, 9:43 AM, "Nick Burch" 
wrote:
>>>
>>>On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>>> My general approach to conflicting metadata is simply
to define
>>>> precedence orders.
>>>>
>>>> For example here is one documented from OODT:
>>>>
>>>>
>>> https://cwiki.apache.org/confluence/display/OODT/
Understanding+CAS-PGE+Metadata+Precendence
>>>>
>>>> We can do similar things with Tika, e.g.,
>>>>
>>>> [CoreMetadata.PROPERTIES]
>>>> [ImageParser.METADATA]
>>> 

Re: [VOTE] Release Apache Tika 1.17 Candidate #2

2017-12-12 Thread Luís Filipe Nassif
All seems ok after integrating in our system and testing with our limited
regression corpus.

Luis

2017-12-11 13:13 GMT-02:00 Luís Filipe Nassif <lfcnas...@gmail.com>:

> Built on Windows 10 Pro with jdk 1.8.0_152 x64, all tests passed. So +1
> from me.
>
> PS: Running regression test on our 1M forensic test corpus...
>
> Luis
>
> 2017-12-08 22:43 GMT-02:00 Tim Allison <talli...@apache.org>:
>
>>
>>
>> On Friday, December 8, 2017, 7:43:05 PM EST, Tim Allison <
>> tallison_apa...@yahoo.com> wrote:
>>
>>
>> A candidate for the Tika 1.17 release is available at:
>>   https://dist.apache.org/repos/dist/dev/tika/
>>
>> The release candidate is a zip archive of the sources in:
>>   https://github.com/apache/tika/tree/1.17-rc2/
>>
>> The SHA1 checksum of the archive is
>>   c6a267956e82365c3a2b456819205763921f2f9d.
>>
>> In addition, a staged maven repository is available here:
>>   https://repository.apache.org/content/repositories/orgapache
>> tika-1028/org/apache/tika/
>>
>> Please vote on releasing this package as Apache Tika 1.17.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Tika 1.17
>> [ ] -1 Do not release this package because...
>>
>> +1 for me
>>
>
>


Re: [VOTE] Release Apache Tika 1.17 Candidate #2

2017-12-11 Thread Luís Filipe Nassif
Built on Windows 10 Pro with jdk 1.8.0_152 x64, all tests passed. So +1
from me.

PS: Running regression test on our 1M forensic test corpus...

Luis

2017-12-08 22:43 GMT-02:00 Tim Allison :

>
>
> On Friday, December 8, 2017, 7:43:05 PM EST, Tim Allison <
> tallison_apa...@yahoo.com> wrote:
>
>
> A candidate for the Tika 1.17 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   https://github.com/apache/tika/tree/1.17-rc2/
>
> The SHA1 checksum of the archive is
>   c6a267956e82365c3a2b456819205763921f2f9d.
>
> In addition, a staged maven repository is available here:
>   https://repository.apache.org/content/repositories/
> orgapachetika-1028/org/apache/tika/
>
> Please vote on releasing this package as Apache Tika 1.17.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.17
> [ ] -1 Do not release this package because...
>
> +1 for me
>


Re: Tika 1.17?

2017-12-08 Thread Luís Filipe Nassif
Yes, Tim, I saw all these reporting artifacs, I agree they are good things.

2017-12-08 14:32 GMT-02:00 Allison, Timothy B. <talli...@mitre.org>:

> Thank you, Luís.  I’ve finally had a chance to take a look.  As exceptions
> go, the PPT is the most eye-opening.  I don’t know how I didn’t catch
> those…ugh.
>
>
>
> There are a bunch more exceptions for zerobyte file exceptions in
> attachments, but this is a good thing, because now we can figure out if
> those are corrupt files, missing dependencies or something else…just a
> reporting artifact.
>
>
>
> There are a bunch more exceptions for emf/wmf caused by “safelyAllocate”,
> which, I think, is a good thing.  After the release, I’ll want to look at
> those to see if we need improvements in emf/wmf parsing, or if we need to
> bump the maximum expected byte lengths in the calls to safelyAllocate, or
> if the files are just plain corrupt.
>
>
>
> After I fix TIKA-2483, I think I’ll be good to roll rc1 for 1.17.
>
>
>
> Anything else holding us back?
>
>
>
> *From:* Luís Filipe Nassif [mailto:lfcnas...@gmail.com]
> *Sent:* Thursday, December 7, 2017 1:18 PM
> *To:* dev@tika.apache.org; Allison, Timothy B. <talli...@mitre.org>
> *Subject:* Fwd: Tika 1.17?
>
>
>
> Oh sorry, I thought I have sent to dev list, forwarding...
>
>
>
> Luis
>
>
>
> -- Forwarded message --
> From: *Allison, Timothy B.* <talli...@mitre.org>
> Date: 2017-12-07 14:10 GMT-02:00
> Subject: RE: Tika 1.17?
> To: "lfcnas...@gmail.com" <lfcnas...@gmail.com>
>
> Agreed.  Thank you!  Do you mind sharing this with the list?
>
>
>
> *From:* Luís Filipe Nassif [mailto:lfcnas...@gmail.com]
> *Sent:* Thursday, December 7, 2017 10:26 AM
> *To:* Allison, Timothy B. <talli...@mitre.org>
> *Subject:* RE: Tika 1.17?
>
>
>
> Hi Tim,
>
>
>
> I don't think it is a blocker, maybe a minor regression, given we are much
> better with 20x more fixed exceptions. I sent it just to let us be aware.
> There are some few ~40 new exceptions with pdf, and 20x more fixed ones, so
> my vote is to go for 1.17!
>
>
>
> Luis
>
>
>
>
>
> Em 7 de dez de 2017 11:47 AM, "Allison, Timothy B." <talli...@mitre.org>
> escreveu:
>
> Thank you, Luís!  Given where POI is in its dev cycle, should we go for a
> release of 1.17 now and then push for a 1.17.1 as soon as POI fixes this?
> Should we revert to 3.17-beta1? (wait, we can't do this because of a bug
> that prevents parsing of pptx in Solr)
>
> Or is this grave enough to wait a few months before we release 1.17?
>
> I found a zip/mime detection issue that we need to fix at the Tika level,
> but that fix is trivial.
>
>
> -Original Message-
> From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com]
> Sent: Wednesday, December 6, 2017 9:30 AM
> To: dev@tika.apache.org
> Subject: Re: Tika 1.17?
>
> Hi Tim,
>
> I've had a briefly look at exceptions folder, seems we are much better
> with ppt (4677 fixed exceptions) and pdf (7798), but there are 208 new
> exceptions with ppt. I did not check the files to see if they are
> corrupted, but some common tokens were lost. Below the most common new
> stacktrace:
>
> org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the
> class for type with id 1000 on class class org.apache.poi.hslf.record.Document
> :
> java.lang.reflect.InvocationTargetException
> Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't
> instantiate the class for type with id 1010 on class class
> org.apache.poi.hslf.record.Environment :
> java.lang.reflect.InvocationTargetException
> Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't
> instantiate the class for type with id 2005 on class class
> org.apache.poi.hslf.record.FontCollection :
> java.lang.reflect.InvocationTargetException
> Cause was : java.lang.IllegalArgumentException: typeface can't be null
> nor empty at org.apache.poi.hslf.record.Record.createRecordForType(
> Record.java:186)
> at org.apache.poi.hslf.record.Record.buildRecordAtOffset(Record.java:104)
> at
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(
> HSLFSlideShowImpl.java:279)
> at
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(
> HSLFSlideShowImpl.java:260)
> at
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(
> HSLFSlideShowImpl.java:166)
> at
> org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:181)
> at
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(
> HSLFExtractor.java:78)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java

Fwd: Tika 1.17?

2017-12-07 Thread Luís Filipe Nassif
Oh sorry, I thought I have sent to dev list, forwarding...

Luis

-- Forwarded message --
From: Allison, Timothy B. <talli...@mitre.org>
Date: 2017-12-07 14:10 GMT-02:00
Subject: RE: Tika 1.17?
To: "lfcnas...@gmail.com" <lfcnas...@gmail.com>


Agreed.  Thank you!  Do you mind sharing this with the list?



*From:* Luís Filipe Nassif [mailto:lfcnas...@gmail.com]
*Sent:* Thursday, December 7, 2017 10:26 AM
*To:* Allison, Timothy B. <talli...@mitre.org>
*Subject:* RE: Tika 1.17?



Hi Tim,



I don't think it is a blocker, maybe a minor regression, given we are much
better with 20x more fixed exceptions. I sent it just to let us be aware.
There are some few ~40 new exceptions with pdf, and 20x more fixed ones, so
my vote is to go for 1.17!



Luis





Em 7 de dez de 2017 11:47 AM, "Allison, Timothy B." <talli...@mitre.org>
escreveu:

Thank you, Luís!  Given where POI is in its dev cycle, should we go for a
release of 1.17 now and then push for a 1.17.1 as soon as POI fixes this?
Should we revert to 3.17-beta1? (wait, we can't do this because of a bug
that prevents parsing of pptx in Solr)

Or is this grave enough to wait a few months before we release 1.17?

I found a zip/mime detection issue that we need to fix at the Tika level,
but that fix is trivial.


-Original Message-
From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com]
Sent: Wednesday, December 6, 2017 9:30 AM
To: dev@tika.apache.org
Subject: Re: Tika 1.17?

Hi Tim,

I've had a briefly look at exceptions folder, seems we are much better with
ppt (4677 fixed exceptions) and pdf (7798), but there are 208 new
exceptions with ppt. I did not check the files to see if they are
corrupted, but some common tokens were lost. Below the most common new
stacktrace:

org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the
class for type with id 1000 on class class org.apache.poi.hslf.record.Document
:
java.lang.reflect.InvocationTargetException
Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 1010 on class class
org.apache.poi.hslf.record.Environment :
java.lang.reflect.InvocationTargetException
Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 2005 on class class
org.apache.poi.hslf.record.FontCollection :
java.lang.reflect.InvocationTargetException
Cause was : java.lang.IllegalArgumentException: typeface can't be null nor
empty at org.apache.poi.hslf.record.Record.createRecordForType(
Record.java:186)
at org.apache.poi.hslf.record.Record.buildRecordAtOffset(Record.java:104)
at
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(
HSLFSlideShowImpl.java:279)
at
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(
HSLFSlideShowImpl.java:260)
at
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(
HSLFSlideShowImpl.java:166)
at
org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:181)
at
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:78)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:179)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
at
org.apache.tika.parser.RecursiveParserWrapper.parse(
RecursiveParserWrapper.java:158)
at
org.apache.tika.batch.FileResourceConsumer.parse(
FileResourceConsumer.java:406)
at
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsum
er.processFileResource(RecursiveParserWrapperFSConsumer.java:104)
at
org.apache.tika.batch.FileResourceConsumer._processFileResource(
FileResourceConsumer.java:181)
at
org.apache.tika.batch.FileResourceConsumer.call(
FileResourceConsumer.java:115)
at
org.apache.tika.batch.FileResourceConsumer.call(
FileResourceConsumer.java:50)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor283.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:182)
... 25 more
Caused by: org.apache.poi.hslf.exceptions.H

Re: Tika 1.17?

2017-12-06 Thread Luís Filipe Nassif
Hi Tim,

I've had a briefly look at exceptions folder, seems we are much better with
ppt (4677 fixed exceptions) and pdf (7798), but there are 208 new
exceptions with ppt. I did not check the files to see if they are
corrupted, but some common tokens were lost. Below the most common new
stacktrace:

org.apache.poi.hslf.exceptions.HSLFException: Couldn't instantiate the
class for type with id 1000 on class class
org.apache.poi.hslf.record.Document :
java.lang.reflect.InvocationTargetException
Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 1010 on class class
org.apache.poi.hslf.record.Environment :
java.lang.reflect.InvocationTargetException
Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 2005 on class class
org.apache.poi.hslf.record.FontCollection :
java.lang.reflect.InvocationTargetException
Cause was : java.lang.IllegalArgumentException: typeface can't be null nor
empty
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:186)
at org.apache.poi.hslf.record.Record.buildRecordAtOffset(Record.java:104)
at
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:279)
at
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:260)
at
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.(HSLFSlideShowImpl.java:166)
at
org.apache.poi.hslf.usermodel.HSLFSlideShow.(HSLFSlideShow.java:181)
at
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:78)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:179)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
at
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
at
org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:406)
at
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)
at
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:181)
at
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)
at
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor283.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:182)
... 25 more
Caused by: org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 1010 on class class
org.apache.poi.hslf.record.Environment :
java.lang.reflect.InvocationTargetException
Cause was : org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 2005 on class class
org.apache.poi.hslf.record.FontCollection :
java.lang.reflect.InvocationTargetException
Cause was : java.lang.IllegalArgumentException: typeface can't be null nor
empty
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:186)
at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:129)
at org.apache.poi.hslf.record.Document.(Document.java:133)
... 29 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor285.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:182)
... 31 more
Caused by: org.apache.poi.hslf.exceptions.HSLFException: Couldn't
instantiate the class for type with id 2005 on class class
org.apache.poi.hslf.record.FontCollection :
java.lang.reflect.InvocationTargetException
Cause was : java.lang.IllegalArgumentException: typeface can't be null nor
empty
at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:186)
at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:129)
at 

Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member

2017-09-04 Thread Luís Filipe Nassif
Very welcome Madhav!

Luis

Em 2 de set de 2017 2:45 AM, "Madhav Sharan"  escreveu:

> Thanks a lot, everyone! So glad to be here.
>
> About me - I am a software engineer recently graduated from USC. I am
> interested in the understanding data corpuses and building
> applications that arise from massive data sets. In spare time I like to
> read books, play sports go on a hike etc..
>
> My affair with Tika - I started using it for one of my research projects
> with Chris at USC IRDS. I have contributed in a couple of areas like
> GeoTopicParser, Extending object recognizer on videos, floating author age
> prediction inside Tika etc..
>
> I hope to find more areas I could help in Tika and will surely love to know
> more about the friendly community around this project.
>
> GitHub - https://github.com/smadha
>
> Cheers!!
> --
> Madhav Sharan
> Email - msha...@usc.edu   goyal.mad...@gmail.com
>
>
>
> On Fri, Sep 1, 2017 at 12:43 PM, Thejan Wijesinghe <
> thejan.k.wijesin...@gmail.com> wrote:
>
> > Congratulations Madhav Sharan ☺
> >
> > On Fri, Sep 1, 2017 at 6:58 AM, Tyler Bui-Palsulich <
> tpalsul...@apache.org
> > >
> > wrote:
> >
> > > Welcome, Madhav!
> > >
> > > Tyler
> > >
> > > On Aug 31, 2017 1:22 PM, "Allison, Timothy B." 
> > wrote:
> > >
> > > > W00t!  Welcome, Madhav!
> > > >
> > > > -Original Message-
> > > > From: Chris Mattmann [mailto:mattm...@apache.org]
> > > > Sent: Thursday, August 31, 2017 3:52 PM
> > > > To: dev@tika.apache.org
> > > > Subject: Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and
> PMC
> > > > Member
> > > >
> > > > Welcome Madhav!
> > > >
> > > > Cheers,
> > > > Chris
> > > >
> > > >
> > > >
> > > >
> > > > On 8/31/17, 12:29 PM, "loo...@gmail.com on behalf of Dave Meikle" <
> > > > loo...@gmail.com on behalf of dmei...@apache.org> wrote:
> > > >
> > > > Hello Everyone,
> > > >
> > > > Please join me in welcoming Madhav Sharan as a PMC Members and
> > > > Committer to
> > > > the project!
> > > >
> > > > Welcome to the team, Madhav. Feel free to say a bit about
> > yourselves
> > > > and
> > > > how you got involved in Tika.
> > > >
> > > > Cheers,
> > > > Dave
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>


Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-10 Thread Luís Filipe Nassif
I don't think it is needed. Built on Win7, jdk1.8.0_131. Tests passed with
and without tesseract 3.05.

+1 from me.

Regards,
Luis

2017-07-10 14:10 GMT-03:00 Allison, Timothy B. <talli...@mitre.org>:

> Is this worth a re-spin?
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, July 10, 2017 10:26 AM
> To: lfcnas...@gmail.com
> Cc: dev@tika.apache.org
> Subject: RE: [VOTE] Release Apache Tika 1.16 Candidate #1
>
> Y. I need to fix that unit test.  Thank you!
>
> https://issues.apache.org/jira/browse/TIKA-2426
>
> From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com]
> Sent: Monday, July 10, 2017 9:29 AM
> To: u...@tika.apache.org
> Cc: dev@tika.apache.org; Tim Allison <talli...@apache.org>
> Subject: Re: [VOTE] Release Apache Tika 1.16 Candidate #1
>
> OK, that is a Locale issue, working around...
>
> 2017-07-10 10:24 GMT-03:00 Luís Filipe Nassif <lfcnas...@gmail.com lfcnas...@gmail.com>>:
> I got the following failure on Window7, jdk1.8.0_131, in 
> OOXMLParserTest.testXLSBVarious:1537.
> Any ideas?
>
> Failed tests:
>   OOXMLParserTest.testXLSBVarious:1537->TikaTest.assertContains:102
> 13.1211231321 not found in:
> http://www.w3.org/1999/xhtml;>
> 
>   name="extended-properties:AppVersion" content="16.0300" />  name="dc:creator" content="Allison, Timothy B." />  name="extended-properties:Company" content="" />  name="dcterms:created" content="2017-03-09T12:24:26Z" />  name="Last-Modified" content="2017-03-10T14:58:49Z" />  name="dcterms:modified" content="2017-03-10T14:58:49Z" />  name="Last-Save-Date" content="2017-03-10T14:58:49Z" />  name="protected" content="false" />  content="2017-03-10T14:58:49Z" />  content="Microsoft Excel" />  content="2017-03-10T14:58:49Z" />  content="application/vnd.ms-excel.sheet.binary.macroenabled.12" />  name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
>  content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"
> />   name="meta:author" content="Allison, Timothy B." />  name="meta:creation-date" content="2017-03-09T12:24:26Z" />  name="extended-properties:Application" content="Microsoft Excel" />  name="meta:last-author" content="Allison, Timothy B." />  name="Creation-Date" content="2017-03-09T12:24:26Z" />  name="Last-Author" content="Allison, Timothy B." />  name="X-TIKA:origResourceName" 
> content="C:\Users\tallison\Desktop\working\xlsb\"
> />   name="Author" content="Allison, Timothy B." />  content="" />  
>  mySheet1  String
> This is a string  integer 13 
> float 13,1211231321
>  currency $   0,003,,03.00
>  percent 20%
>  float 2 13,12  long int
> 123456789012345  longer int
> 1,23456789012345E+15  Allison, Timothy B.: Allison,
> Timothy B.:
> test comment2
> 
>  fraction 1/4  date
> 3/9/17  comment contents Allison,
> Timothy B.: Allison, Timothy B.:
> test comment
> 
>  hyperlink tika_link  formula
> 4 2  formulaErr ERROR
>  formulaFloat 0,5 March April
>  customFormat146/1963 merchant1
> 1 3
>  customFormat2   3/128 merchant2 2
> 4  text test   Allison,
> Timothy B.: Allison, Timothy B.:
> comment1
> 
>  
> Allison, Timothy B.: Allison, Timothy B.:
> comment2
> 
>  
> Allison, Timothy B.: Allison, Timothy B.:
> comment3
> 
>  the 
> Allison, Timothy B.: Allison, Timothy B.:
> comment4 (end of row)
> 
>  the 
> Allison, Timothy B.: Allison, Timothy B.:
> comment5 between cells
>  quick
>  comment6
> Allison, Timothy B.: Allison, Timothy B.:
> comment6 actually in cell
> 
>  
> Allison, Timothy B.: Allison, Timothy B.:
> comment7 end of file
> 
>  
> Allison, Timothy B.: Allison, Timothy B.:
> comment8 end of file
> 
> OddLeftHeader OddCenterHeader OddRightHeader EvenLeftHeader
> EvenCenterHeader EvenRightHeader  FirstPageLeftHeader
> FirstPageCenterHeader FirstPageRightHeader OddLeftFooter
> OddCenterFooter OddRightFooter EvenLeftFooter EvenCenterFooter
> EvenRightFooter FirstPageLeftFooter FirstPageCenterFooter
> FirstPageRightFooter test textbox  http://lucene.apache.org/;>http://lucene.apache.org/
> myChartTitle
> 
> merchant1 March April 1 3 merchant2 March April 2 4 /> test WordArt myChartTitle 
> merchant1 March April 1 3 merchant2 March April 2 4 /> myChartTitle 
>

Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-10 Thread Luís Filipe Nassif
OK, that is a Locale issue, working around...

2017-07-10 10:24 GMT-03:00 Luís Filipe Nassif <lfcnas...@gmail.com>:

> I got the following failure on Window7, jdk1.8.0_131, in 
> OOXMLParserTest.testXLSBVarious:1537.
> Any ideas?
>
> Failed tests:
>   OOXMLParserTest.testXLSBVarious:1537->TikaTest.assertContains:102
> 13.1211231321 not found in:
> http://www.w3.org/1999/xhtml;>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="application/vnd.ms-excel.sheet.binary.macroenabled.12"
> />
> 
>  content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"
> />
> 
> 
> 
> 
> 
> 
> 
>  content="C:\Users\tallison\Desktop\working\xlsb\"
> />
> 
> 
> 
> 
> 
> 
> mySheet1
>  String This is a string
>  integer 13
>  float 13,1211231321
>  currency $   0,003,,03.00
>  percent 20%
>  float 2 13,12
>  long int 123456789012345
>  longer int 1,23456789012345E+15 
> Allison, Timothy B.: Allison, Timothy B.:
> test comment2
> 
>  fraction 1/4
>  date 3/9/17
>  comment contents
> Allison, Timothy B.: Allison, Timothy B.:
> test comment
> 
>  hyperlink tika_link
>  formula 4 2
>  formulaErr ERROR
>  formulaFloat 0,5 March April
>  customFormat146/1963 merchant1
> 1 3
>  customFormat2   3/128 merchant2 2
> 4
>  text test
>  
> Allison, Timothy B.: Allison, Timothy B.:
> comment1
> 
>  
> Allison, Timothy B.: Allison, Timothy B.:
> comment2
> 
>  
> Allison, Timothy B.: Allison, Timothy B.:
> comment3
> 
>  the 
> Allison, Timothy B.: Allison, Timothy B.:
> comment4 (end of row)
> 
>  the 
> Allison, Timothy B.: Allison, Timothy B.:
> comment5 between cells
>  quick
>  comment6
> Allison, Timothy B.: Allison, Timothy B.:
> comment6 actually in cell
> 
>  
> Allison, Timothy B.: Allison, Timothy B.:
> comment7 end of file
> 
>  
> Allison, Timothy B.: Allison, Timothy B.:
> comment8 end of file
> 
> OddLeftHeader OddCenterHeader OddRightHeader
> EvenLeftHeader EvenCenterHeader EvenRightHeader
> 
> FirstPageLeftHeader FirstPageCenterHeader FirstPageRightHeader
> OddLeftFooter OddCenterFooter OddRightFooter
> EvenLeftFooter EvenCenterFooter EvenRightFooter
> FirstPageLeftFooter FirstPageCenterFooter FirstPageRightFooter
> test textbox
> 
> http://lucene.apache.org/;>http://lucene.apache.org/
> myChartTitle
> 
> merchant1 March April 1 3 merchant2 March April 2 4 
> 
> 
> 
> test WordArt
> myChartTitle
> 
> merchant1 March April 1 3 merchant2 March April 2 4 
> 
> 
> 
> myChartTitle
> 
> merchant1 March April 1 3 merchant2 March April 2 4 
> 
> 
> 
> http://tika.apache.org/;>http://tika.apache.org/
>  class="package-entry" />
>
> 2017-07-10 10:17 GMT-03:00 JB Data <jbdat...@gmail.com>:
>
>> +1.
>> No regression in my 1.15 env <http://jbigdata.fr/jbigdata/ged-02.html>.
>> Test docx chart extraction (TIKA-2254): OK.
>>
>> @*JB*Δ <http://jbigdata.fr>
>>
>>
>> 2017-07-08 22:29 GMT+02:00 Chris Mattmann <mattm...@apache.org>:
>>
>>> +1 from me SIGS and CHECKSUMS look good.
>>>
>>> Thanks Tim!
>>>
>>> Cheers,
>>> Chris
>>>
>>> LMC-053601:apache-tika-1.16-rc1 mattmann$ for type in "" \-app \-eval
>>> \-server; do $HOME/bin/stage_apache_rc tika$type 1.16
>>> https://dist.apache.org/repos/dist/dev/tika/; done
>>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>>> Current
>>>  Dload  Upload   Total   SpentLeft
>>> Speed
>>> 100 53.5M  100 53.5M0 0  3992k  0  0:00:13  0:00:13 --:--:--
>>> 5122k
>>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>>> Current
>>>  Dload  Upload   Total   SpentLeft
>>> Speed
>>> 100   836  100   8360 0   1092  0 --:--:-- --:--:--
>>> --:--:--  1092
>>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>>> Current
>>>  Dload  Upload   Total   SpentLeft
>>> Speed
>>> 10034  100340 0 96  0 --:--:-- --:--:--
>>> --:--:--96
>>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>>> Current
>>>  Dload  Upload   Total   SpentLeft
>>> Speed
>>> 100 41.6M  100 41.6M0 0  6578k  0  

Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-10 Thread Luís Filipe Nassif
I got the following failure on Window7, jdk1.8.0_131, in
OOXMLParserTest.testXLSBVarious:1537. Any ideas?

Failed tests:
  OOXMLParserTest.testXLSBVarious:1537->TikaTest.assertContains:102
13.1211231321 not found in:
http://www.w3.org/1999/xhtml;>






























mySheet1
 String This is a string
 integer 13
 float 13,1211231321
 currency $   0,003,,03.00
 percent 20%
 float 2 13,12
 long int 123456789012345
 longer int 1,23456789012345E+15 
Allison, Timothy B.: Allison, Timothy B.:
test comment2

 fraction 1/4
 date 3/9/17
 comment contents
Allison, Timothy B.: Allison, Timothy B.:
test comment

 hyperlink tika_link
 formula 4 2
 formulaErr ERROR
 formulaFloat 0,5 March April
 customFormat146/1963 merchant1
1 3
 customFormat2   3/128 merchant2 2
4
 text test
 
Allison, Timothy B.: Allison, Timothy B.:
comment1

 
Allison, Timothy B.: Allison, Timothy B.:
comment2

 
Allison, Timothy B.: Allison, Timothy B.:
comment3

 the 
Allison, Timothy B.: Allison, Timothy B.:
comment4 (end of row)

 the 
Allison, Timothy B.: Allison, Timothy B.:
comment5 between cells
 quick
 comment6
Allison, Timothy B.: Allison, Timothy B.:
comment6 actually in cell

 
Allison, Timothy B.: Allison, Timothy B.:
comment7 end of file

 
Allison, Timothy B.: Allison, Timothy B.:
comment8 end of file

OddLeftHeader OddCenterHeader OddRightHeader
EvenLeftHeader EvenCenterHeader EvenRightHeader

FirstPageLeftHeader FirstPageCenterHeader FirstPageRightHeader
OddLeftFooter OddCenterFooter OddRightFooter
EvenLeftFooter EvenCenterFooter EvenRightFooter
FirstPageLeftFooter FirstPageCenterFooter FirstPageRightFooter
test textbox

http://lucene.apache.org/;>http://lucene.apache.org/
myChartTitle

merchant1 March April 1 3 merchant2 March April 2 4 



test WordArt
myChartTitle

merchant1 March April 1 3 merchant2 March April 2 4 



myChartTitle

merchant1 March April 1 3 merchant2 March April 2 4 



http://tika.apache.org/;>http://tika.apache.org/


2017-07-10 10:17 GMT-03:00 JB Data :

> +1.
> No regression in my 1.15 env .
> Test docx chart extraction (TIKA-2254): OK.
>
> @*JB*Δ 
>
>
> 2017-07-08 22:29 GMT+02:00 Chris Mattmann :
>
>> +1 from me SIGS and CHECKSUMS look good.
>>
>> Thanks Tim!
>>
>> Cheers,
>> Chris
>>
>> LMC-053601:apache-tika-1.16-rc1 mattmann$ for type in "" \-app \-eval
>> \-server; do $HOME/bin/stage_apache_rc tika$type 1.16
>> https://dist.apache.org/repos/dist/dev/tika/; done
>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>> Current
>>  Dload  Upload   Total   SpentLeft
>> Speed
>> 100 53.5M  100 53.5M0 0  3992k  0  0:00:13  0:00:13 --:--:--
>> 5122k
>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>> Current
>>  Dload  Upload   Total   SpentLeft
>> Speed
>> 100   836  100   8360 0   1092  0 --:--:-- --:--:-- --:--:--
>> 1092
>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>> Current
>>  Dload  Upload   Total   SpentLeft
>> Speed
>> 10034  100340 0 96  0 --:--:-- --:--:-- --:--:--
>>   96
>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>> Current
>>  Dload  Upload   Total   SpentLeft
>> Speed
>> 100 41.6M  100 41.6M0 0  6578k  0  0:00:06  0:00:06 --:--:--
>> 8297k
>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>> Current
>>  Dload  Upload   Total   SpentLeft
>> Speed
>> 100   836  100   8360 0   1012  0 --:--:-- --:--:-- --:--:--
>> 1012
>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>> Current
>>  Dload  Upload   Total   SpentLeft
>> Speed
>> 10034  100340 0 46  0 --:--:-- --:--:-- --:--:--
>>   46
>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>> Current
>>  Dload  Upload   Total   SpentLeft
>> Speed
>> 100 56.4M  100 56.4M0 0  3950k  0  0:00:14  0:00:14 --:--:--
>> 4742k
>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>> Current
>>  Dload  Upload   Total   SpentLeft
>> Speed
>> 100   836  100   8360 0   1470  0 --:--:-- --:--:-- --:--:--
>> 1469
>>   % Total% Received % Xferd  Average Speed   TimeTime Time
>> Current
>>  Dload  Upload   Total   SpentLeft
>> Speed
>> 10034  100340 0 65  0 --:--:-- --:--:-- --:--:--
>>   65
>> LMC-053601:apache-tika-1.16-rc1 mattmann$ $HOME/bin/stage_apache_rc tika
>> 1.16-src https://dist.apache.org/repos/dist/dev/tika/
>>   % Total% Received % Xferd  Average Speed   TimeTime 

Re: Tika 1.15.1? -> 1.16

2017-07-05 Thread Luís Filipe Nassif
Hi Tim,

Taking a fast look at Nick's fix on TIKA-2419 seems conservative to me,
restricted to corrupted xml, so I think there is no need to rerun the
regression tests.

So +1 from me, ++1 with age detection :)

2017-07-05 22:35 GMT-03:00 Allison, Timothy B. <talli...@mitre.org>:

> All,
>   I'm waiting to get some resolution on TIKA-2399.  The regression tests
> came back with nothing surprising.  I fixed the npe that they uncovered in
> the new ppt macro extraction code.
>   Will I need to rerun with the updates to mime detection that Nick just
> made?  Or are we good enough to go once we figure out what we can do w
> TIKA-2399?
>
>   Onward.
>
>Cheers,
>  Tim
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, July 3, 2017 2:35 PM
> To: dev@tika.apache.org
> Subject: RE: Tika 1.15.1? -> 1.16
>
> Sounds good. I'll kick off regression tests now, with a goal of creating
> 1.16-rc1 on Wednesday 14:00 UTC?
>
> -Original Message-
> From: Mattmann, Chris A (3010) [mailto:chris.a.mattm...@jpl.nasa.gov]
> Sent: Monday, July 3, 2017 2:24 PM
> To: dev@tika.apache.org
> Subject: Re: Tika 1.15.1? -> 1.16
>
> Hey Tim, if I don’t get it done by today, push 1.16 and we’ll put Age
> Detection in 1.17.
>
> ++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS) Adjunct
> Associate Professor, Computer Science Department University of Southern
> California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
> On 7/3/17, 7:17 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> All,
>   I think we're now solidly at 1.16.  Anyone still strongly in favor
> of 1.15.1?
>
> Chris,
>   Will age detection be ready soon, or should we push that to 1.17?
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Friday, June 30, 2017 7:01 AM
> To: dev@tika.apache.org; lfcnas...@gmail.com
> Subject: RE: Tika 1.15.1? -> 1.16
>
> Y, I was thinking that I may have already pushed us over this
> threshold with the * below.  1.16 it is then?
>
> Chris, let us know when the age detection is good to go or if 1.17 is
> a better target.
>
>
>   * Allow extraction of scripts as embedded "MACRO". Users
> must turn this on via TikaConfig (TIKA-2391).
>
>   * Allow users to turn off extraction of headers and footers
> from .doc, .docx, .xls, .xlsx, .xlsb (TIKA-2362)
>
>   * Extract text from charts in .docx, .pptx, .xlsx and .xlsb
> (TIKA-2254).
>
>   * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb
> (TIKA-1945).
>
>   * Enable base32 encoding of digests and enable BouncyCastle
> implementations
> of digest algorithms (TIKA-2386).
>
> -Original Message-
> From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com]
> Sent: Thursday, June 29, 2017 4:12 PM
> To: dev@tika.apache.org
> Subject: Re: Tika 1.15.1?
>
> Agreed.
>
> Luis
>
>
> 2017-06-29 15:45 GMT-03:00 Bob Paulin <b...@bobpaulin.com>:
>
> > If we're adding features does it make sense just to bump to 1.16
> > rather than 1.15.1?  Traditionally point releases would be bug fixes
> only [1].
> >
> >
> > - Bob
> >
> > [1] http://semver.org/
> > On 6/29/2017 1:18 PM, Allison, Timothy B. wrote:
> > > K.
> > >
> > > -Original Message-
> > > From: Mattmann, Chris A (3010)
> > > [mailto:chris.a.mattm...@jpl.nasa.gov]
> > > Sent: Thursday, June 29, 2017 1:59 PM
> > > To: dev@tika.apache.org
> > > Subject: Re: Tika 1.15.1?
> > >
> > > Hey Tim, I’d like to try and get in:
> > >
> > > https://issues.apache.org/jira/browse/TIKA-1988
> > >
> > > today for 15.1. I am working on integrating it now and adding some
> > > docs
&

Re: Tika 1.15.1?

2017-06-29 Thread Luís Filipe Nassif
erday!
> >
> >
> >
> >
> > On 6/16/17, 11:40 AM, "Allison, Timothy B." <talli...@mitre.org>
> wrote:
> >
> > All,
> >
> > I'm hoping to wrap up the TEIParser next week (I'm thinking
> about modifying code to handle DOM)...and this should rid us of org.json
> licensing issues.  Run a release for 1.15.1 probably the following week?
> >
> > Anything else we want to get in to 1.15.1?
> >
> > Chris, I'm not sure where you are on the SentimentParser.
> If there will be a quick fix, great; otherwise, we should be ok with the
> added exclusions (TIKA-2397) and if we rename the class in Tika so that we
> don't have a conflict over oat.parsers.SentimentParser (TIKA-2368).
> >
> > Cheers,
> >
> >           Tim
> >
> > -Original Message-
> > From: Tyler Bui-Palsulich [mailto:tbpalsul...@gmail.com]
> > Sent: Friday, June 2, 2017 8:39 PM
> > To: dev@tika.apache.org
> > Subject: Re: Tika 1.16?
> >
> > +1 to 1.15.1.
> >
> > It would also be nice to be able to have "cheap" security
> releases as they come up.
> >
> > Tyler
> >
> > On Jun 2, 2017 6:12 AM, "Bob Paulin" <b...@bobpaulin.com>
> wrote:
> >
> > > Would be breaking a bit from the current release numbering
> but I'd
> > > fully support moving to semantic versioning.  +1 to a
> 1.15.1
> > >
> > > - Bob
> > >
> > >
> > > On 6/2/2017 8:06 AM, Luís Filipe Nassif wrote:
> > > > Maybe 1.15.1?
> > > >
> > > > Em 1 de jun de 2017 10:03 AM, "Bob Paulin" <
> b...@bobpaulin.com> escreveu:
> > > >
> > > >> +1
> > > >>
> > > >>
> > > >> On 6/1/2017 6:50 AM, Allison, Timothy B. wrote:
> > > >>> Given the broken OSGi and the org.json issues with
> 1.15, does it
> > > >>> make
> > > >> sense to aim for 1.16 fairly soon, say 3-4 weeks?
> > > >>> Cheers,
> > > >>>
> > > >>>   Tim
> > > >>>
> > > >>>
> > > >>
> > > >>
> > >
> > >
> > >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>


Re: Tika 1.16?

2017-06-02 Thread Luís Filipe Nassif
Maybe 1.15.1?

Em 1 de jun de 2017 10:03 AM, "Bob Paulin"  escreveu:

> +1
>
>
> On 6/1/2017 6:50 AM, Allison, Timothy B. wrote:
> > Given the broken OSGi and the org.json issues with 1.15, does it make
> sense to aim for 1.16 fairly soon, say 3-4 weeks?
> >
> > Cheers,
> >
> >   Tim
> >
> >
>
>
>


Re: [ANNOUNCE] Apache Tika 1.15 released

2017-06-02 Thread Luís Filipe Nassif
Late to the party...

Great work Tim!

Thank you for all your huge work with Tika!

Em 30 de mai de 2017 3:10 PM, "Tim Allison"  escreveu:

> The Apache Tika project is pleased to announce the release of Apache Tika
> 1.15. The release contents have been pushed out to the main Apache release 
> site and to the
> Maven Central sync, so the releases should be available as soon as the 
> mirrors get the syncs.
>
> Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser 
> libraries.
>
> Apache Tika 1.15 contains a number of improvements and bug fixes. Details
> can be found in the changes 
> file:http://www.apache.org/dist/tika/CHANGES-1.15.txt
>
> Apache Tika is available in source form from the following download 
> page:http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.15-src.zip
>
> Apache Tika is also available in binary form or for use using Maven 2 from
> the Central Repository:http://repo1.maven.org/maven2/org/apache/tika/
>
> In the initial 48 hours, the release may not be available on all mirrors.
> When downloading
> from a mirror site, please remember to verify the downloads using
> signatures found on the
> Apache site:https://people.apache.org/keys/group/tika.asc
>
> For more information on Apache Tika, visit the project home 
> page:http://tika.apache.org/
>
> -- Tim Allison, on behalf of the Apache Tika community
>
>
>
>
>


Change Scope of Jai-ImageIO-Core dependency

2017-04-21 Thread Luís Filipe Nassif
Hi devs,

Looks like jai-imageio-core from github (
https://github.com/jai-imageio/jai-imageio-core) on which we depend with
test scope is Apache compatible.

Note that is a fork from the original Jai project referenced by PDFBox. The
github fork has extracted jpeg2000 and other problematic code to a separate
project.

I propose removing test scope from jai-imageio-core dependency, so we will
provide support for tiff and other formats (except jpeg2000) out of the box.

Best,
Luis


Re: 1.15?

2017-04-19 Thread Luís Filipe Nassif
+1 from me, there are so many fixes and improvements!

Best,
Luis

Em 18 de abr de 2017 03:13, "Oleg Tikhonov"  escreveu:

> +1 for the release.
>
> On Mon, Apr 17, 2017 at 8:39 PM, David Meikle  wrote:
>
> > +1 from me too.
> >
> > Cheers,
> > Dave
> >
> > On 13 April 2017 at 13:08, Konstantin Gribov  wrote:
> >
> > > Preliminary +1 from me, I'll the a closer look this weekend
> > >
> > > чт, 13 апр. 2017, 0:00 Allison, Timothy B. :
> > >
> > > > All,
> > > >   POI is voting on rc1 of the next release.  Once that's released and
> > > > integrated into Tika, let's start the release process for Tika 1.15,
> > end
> > > of
> > > > next week, middle of following?  Any blockers?
> > > >
> > > >  Cheers,
> > > >
> > > >  Tim
> > > >
> > > >
> > > > --
> > >
> > > Best regards,
> > > Konstantin Gribov
> > >
> >
>


Re: Improving Tika OCR

2017-04-17 Thread Luís Filipe Nassif
Hi Kranthi,

That is an interesting comparison! But I think Tesseract 4.0 is still
alpha? And do you know the VGG software license?

Best,
Luis

Em 17 de abr de 2017 8:46 AM, "Kranthi Kiran G V" <
kkran...@student.nitw.ac.in> escreveu:

Hello Tim Allison,

I am currently working on improving Tika's OCR capabilities.
After suggestion from Thamme Gowda (@thammegowda
),
I started to work on comparison of Tesseract 4.0's neural network

subsystem and Visual Geometry Group's (VGG) models
.

It would be great if you provide the dataset to test the OCR as you
mentioned in one of the issues.

I would be comparing their running time for evaluation, accuracy, memory
consumed and invariance to lighting, orientation, etc. And then I would be
integrating the appropriate models into Tika's OCR.

Thank you,
Kranthi Kiran GV,
CS 3/4 Undergrad,
NIT Warangal


Re: [COMPRESS] zip-bomb prevention for Z?

2017-04-13 Thread Luís Filipe Nassif
I have reported a similar issue to them, see Compress-382, maybe those
issues should be handled at Compress side, if I understood correctly the
API contract.

Luis


Em 13 de abr de 2017 3:36 PM, "Allison, Timothy B." 
escreveu:

On TIKA-1631 [1], users have observed that a corrupt Z file can cause an
OOM at Internal_.InternalLZWStream.initializeTable.  Should we try to
protect against this at the Tika level, or should we open an issue on
commons-compress's JIRA?

A second question, we're creating a stream with the CompressorStreamFactory
when all we want to do is detect.  Is there a recommended way to detect the
type of compressor without creating a stream?

Thank you!

Best,

 Tim

[1] https://issues.apache.org/jira/browse/TIKA-1631


Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Luís Filipe Nassif
Hi Thejan,

Before the first version of TesseractOcrParser was commited I tried to use
Tess4j, that was 4 years ago. Unfortunatelly that time I run into some
problems like permanent hangs with tesseract/Tess4j and, even worse, Jvm
crashes because of bugs into native code (pointers to crazy adresses) when
processing corrupted images. So I changed the strategy and take the
Runtime.exec way to execute tesseract out of process to get rid of those
Jvm crashes.

That was a long time ago, maybe those problems are gone away with current
tesseract and Tess4j. But I recommend for now commiting your changes in a
new parser instead of changing the default TesseractOcrParser, until the
new code is tested against millions of images from the wild with tika-batch
so it can be proved it is stable enough to be the default Ocr parser of
Tika.

Best,
Luis

Em 7 de mar de 2017 9:58 AM, "Thejan Wijesinghe" <
thejan.k.wijesin...@gmail.com> escreveu:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method signatures
> unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch  wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest class. I
> >> can
> >> solve that. However, I want to know whether I can rewrite the entire
> >> TesseractOCRParser class from the ground up, but if I do that there will
> >> be
> >> many broken links in the internals of TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, other
> > callers to the class will be unaffected by your re-write of the internal
> > logic
> >
> > Nick
> >
>


Re: Testing an ingest framework that uses Apache Tika

2017-02-16 Thread Luís Filipe Nassif
Excellent, Tim! Thank you for all your great work on Apache Tika!

2017-02-16 11:23 GMT-02:00 Konstantin Gribov :

> Tim,
>
> it's a awesome feature for downstream projects' integration tests. Thanks
> for implementing it!
>
> чт, 16 февр. 2017 г. в 16:17, Allison, Timothy B. :
>
> > All,
> >
> > I finally got around to documenting Apache Tika's MockParser[1].  As of
> > Tika 1.15 (unreleased), add tika-core-tests.jar to your class path, and
> you
> > can simulate:
> >
> > 1. Regular catchable exceptions
> > 2. OOMs
> > 3. Permanent hangs
> >
> > This will allow you to determine if your ingest framework is robust
> > against these issues.
> >
> > As always, we fix Tika when we can, but if history is any indicator,
> > you'll want to make sure your ingest code can handle these issues if you
> > are handling millions/billions of files from the wild.
> >
> > Cheers,
> >
> > Tim
> >
> >
> > [1] https://wiki.apache.org/tika/MockParser
> >
> --
>
> Best regards,
> Konstantin Gribov
>


RE: Tika 1.14?

2016-08-12 Thread Luís Filipe Nassif
I think waiting for pdfbox 2.0.3 would be great. There are some regressions
fixed.

Regards,
Luis

Em 12 de ago de 2016 08:24, "Allison, Timothy B." 
escreveu:

> >> I know it's been a little bit since we talked about 2.0.  We had
> discussed holding off while some API changes that were under
> consideration.  Has any progress been made on this?
>
> > I think we're still trying to come up with a plan for how to allow
> multiple parsers to report text for one
>
> And
> > I believe we've also still got the issue of structured metadata
> outstanding.
>
> Y, I agree on both for 2.0.  Anything else that we need to get into 1.14?
> Sounds like there's a chance we'll be able to get PDFBox 2.0.3 into 1.14...
>
> Thamme, any interest in wrapping up TIKA-1508/TIKA-1680/TIKA-1986 in the
> next few weeks?
>
> Cheers,
>
>   Tim
>
>