Re: AutoDetectParser not working after upgrading from 2.7.0 to 2.8.0+

Gerardo Hernandez Mon, 08 Apr 2024 23:18:11 -0700

Hi Tim,

I think that the issue might be related to the assembly step, this is how I'm 
reproducing it (files attached in the first email TikaUpgrade.rar):


mvn clean compile assembly:single
java -cp "target\tika-1.0.0-jar-with-dependencies.jar" 
com.company.tikatest.TikaTest path/to/lorem.txt

Is there anything weird or missing on the assembly goal that could cause the 
dependecy tika-parsers-standard-package not to import all the parsers properly?
I also found this report https://issues.apache.org/jira/browse/TIKA-4038 which 
I think is related, but I'm not 100% sure.

Best regards,
Gerardo
________________________________
From: Tim Allison <[email protected]>
Sent: Thursday, April 4, 2024 06:22 AM
To: [email protected] <[email protected]>
Subject: Re: AutoDetectParser not working after upgrading from 2.7.0 to 2.8.0+

I'm on ubuntu. That's the 2.7.0 pom, obv. I just bumped the versions,
reloaded and ran to see different numbers of parsers in 2.7.0 vs
2.8.0+2.9.0.

On Thu, Apr 4, 2024 at 8:20 AM Tim Allison <[email protected]> wrote:
>
> I'm attaching the pom. I can't remember if attachments get stripped.
> If they do, I'll copy+paste.
>
> Apache Maven 3.8.7 (b89d5959fcde851dcb1c8946a785a163f14e1e29)
> temurin-17-jdk-amd64
>
> On Thu, Apr 4, 2024 at 3:16 AM Gerardo Hernandez
> <[email protected]> wrote:
> >
> > Hi Tim,
> >
> > Did you use the exact same pom I shared, or a custom one? If the second, 
> > could you please share it so I can verify if something missing on mine.
> >
> > Also, what jdk/maven versions are you using?
> >
> > Tilman, I get the expected string when printing 
> > System.out.println(org.apache.tika.parser.pdf.PDFParser.PASSWORD); on both 
> > 2.7.0 and 2.8.0+
> >
> > Thanks, and regards,
> > Gerardo
> > ________________________________
> > From: Tim Allison <[email protected]>
> > Sent: Wednesday, April 3, 2024 06:43 AM
> > To: [email protected] <[email protected]>
> > Subject: Re: AutoDetectParser not working after upgrading from 2.7.0 to 
> > 2.8.0+
> >
> > Y, I'm not able to repro this problem with 2.8.0 or higher. I'm seeing
> > 239 parsers (probably diff from Tilman because of installed external
> > parsers?).
> >
> > On Wed, Apr 3, 2024 at 5:09 AM Tilman Hausherr <[email protected]> 
> > wrote:
> > >
> > > On 03.04.2024 08:55, Gerardo Hernandez wrote:
> > > > On 2.7.0, I get a list of 203 parsers, and the file is parser
> > > > successfully:
> > >
> > > I get 227 parsers with 2.9.2. My pom.xml is somewhat different. The main
> > > part is
> > >
> > >
> > >      <dependencies>
> > >          <dependency>
> > >              <groupId>org.apache.tika</groupId>
> > >              <artifactId>tika-core</artifactId>
> > >              <version>${tika.version}</version>
> > >          </dependency>
> > >          <dependency>
> > >              <groupId>org.apache.tika</groupId>
> > > <artifactId>tika-parsers-standard-package</artifactId>
> > >              <version>${tika.version}</version>
> > >          </dependency>
> > >          <dependency>
> > >              <groupId>org.slf4j</groupId>
> > >              <artifactId>slf4j-simple</artifactId>
> > >              <version>${slf4j.version}</version>
> > >          </dependency>
> > >          <dependency>
> > >              <groupId>org.bouncycastle</groupId>
> > >              <artifactId>bcprov-jdk18on</artifactId>
> > >              <version>${bouncycastle.version}</version>
> > >          </dependency>
> > >      </dependencies>
> > >
> > > What happens if you add this on top of your code?
> > >
> > > System.out.println(org.apache.tika.parser.pdf.PDFParser.PASSWORD);
> > >
> > > it should output "org.apache.tika.parser.pdf.password". This is to test
> > > if the PDF parser is in your class path.
> > >
> > > Tilman
> > >

Re: AutoDetectParser not working after upgrading from 2.7.0 to 2.8.0+

Reply via email to