Re: Script tag contents not always reported in ContentHandler

2024-05-30 Thread Tim Allison
Markus, I'm sorry for my delay. We're migrating to jsoup in 3.x. I realize that 3.x isn't out yet, but I wanted to give you a heads up. To extract scripts in 3.x, you'd do something like this:

Re: Setting limits on text extraction for compressed files with Tika Server

2024-05-24 Thread Tim Allison
I'm not sure which endpoint you're using, but search for "writeLimit" on this page: https://cwiki.apache.org/confluence/display/TIKA/TikaServer As you probably know, many file formats are actually compressed: PDF, docx, etc. There is no way to know ahead of time for many file formats what the

multi-arch support for tika-docker!

2024-05-21 Thread Tim Allison
All, Many thanks to the many community members who helped figure this out and get it out the door! As of tika-docker 2.9.2.1, we now have multi-arch support (and on noble!). Let us know if there are any surprises. Thank you, again! Cheers, Tim Ref:

Re: Unexpected behavior when inspecting mp4 files with different ISO

2024-04-29 Thread Tim Allison
during the parse in the above section. On Mon, Apr 29, 2024 at 10:28 AM Tim Allison wrote: > I agree with Nick. > > You can better understand the magic based algorithms we're using for > detection by searching for mp4 and quicktime in this file: > https://github.com/apache/tika

Re: Unexpected behavior when inspecting mp4 files with different ISO

2024-04-29 Thread Tim Allison
I agree with Nick. You can better understand the magic based algorithms we're using for detection by searching for mp4 and quicktime in this file: https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml A middle ground is to have the MP4

Re: Extracting XML comments

2024-04-18 Thread Tim Allison
HI Claude, I'd recommend a custom XMLParser for this, perhaps subclass DcXMLParser? We could also parameterize this in the DcXMLParser if a committer had a chance to add that feature or review a PR from yoou. Best, Tim On Thu, Apr 18, 2024 at 7:33 AM Claude Warren wrote: > It seems

Re: AutoDetectParser not working after upgrading from 2.7.0 to 2.8.0+

2024-04-04 Thread Tim Allison
I'm on ubuntu. That's the 2.7.0 pom, obv. I just bumped the versions, reloaded and ran to see different numbers of parsers in 2.7.0 vs 2.8.0+2.9.0. On Thu, Apr 4, 2024 at 8:20 AM Tim Allison wrote: > > I'm attaching the pom. I can't remember if attachments get stripped. > If they do,

Re: AutoDetectParser not working after upgrading from 2.7.0 to 2.8.0+

2024-04-04 Thread Tim Allison
er.PASSWORD); on both > 2.7.0 and 2.8.0+ > > Thanks, and regards, > Gerardo > ____ > From: Tim Allison > Sent: Wednesday, April 3, 2024 06:43 AM > To: user@tika.apache.org > Subject: Re: AutoDetectParser not working after upgrading from 2.7.0 to 2.

Re: AutoDetectParser not working after upgrading from 2.7.0 to 2.8.0+

2024-04-03 Thread Tim Allison
Y, I'm not able to repro this problem with 2.8.0 or higher. I'm seeing 239 parsers (probably diff from Tilman because of installed external parsers?). On Wed, Apr 3, 2024 at 5:09 AM Tilman Hausherr wrote: > > On 03.04.2024 08:55, Gerardo Hernandez wrote: > > On 2.7.0, I get a list of 203

Re: [ANNOUNCE] Apache Tika 2.9.2 released

2024-04-02 Thread Tim Allison
I also released our docker images for 2.9.2.0. How do we update helm? On Tue, Apr 2, 2024 at 2:31 PM Tim Allison wrote: > > The Apache Tika project is pleased to announce the release of Apache > Tika 2.9.2. The release contents have been pushed out to the main > Apache

[ANNOUNCE] Apache Tika 2.9.2 released

2024-04-02 Thread Tim Allison
, visit the project home page: https://tika.apache.org/ -- Tim Allison, on behalf of the Apache Tika community

[RESULT][VOTE] Release Apache Tika 2.9.2 Candidate #2

2024-04-02 Thread Tim Allison
The vote has passed with 3 PMC +1s and no -1s. +1s Oleg Tikhonov Tilman Hausherr Tim Allison I'll release the artifacts shortly and update the website. Thank you, all! Best, Tim On Tue, Apr 2, 2024 at 12:08 AM Oleg Tikhonov wrote: > +1, > Thanks. > > On Mon, 1 Apr 2

Re: 2.9.1 release?

2024-04-01 Thread Tim Allison
n, 16 Oct 2023 at 11:46, Tom Conlon wrote: > >> Hi, >> Would it be possible for the issue "Fix tika as a service" >> https://issues.apache.org/jira/browse/TIKA-4152 >> to be reviewed before release? >> >> Thanks >> Tom >> >> On M

Re: tika-helm now on artifacthub.io

2024-03-30 Thread Tim Allison
W00t! Thank you Lewis! On Sat, Mar 30, 2024 at 3:57 PM lewis john mcgibbney wrote: > Hi user@, dev@, > > For those running Tika on Kubernetes, you can now conveniently find the > Helm Chart via artifacthub.io > > https://artifacthub.io/packages/helm/apache-tika/tika > > I’ll build in a little

[VOTE] Release Apache Tika 2.9.2 Candidate #2

2024-03-26 Thread Tim Allison
A candidate for the Tika 2.9.2 release is available at: https://dist.apache.org/repos/dist/dev/tika/2.9.2 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/2.9.2-rc2/ The SHA-512 checksum of the archive is

Re: Meta output format of tika server /unpack/all

2024-03-21 Thread Tim Allison
with embedded raw bytes and the rmeta content. Not sure what to call that endpoint. Recommendations? On Thu, Mar 21, 2024 at 6:10 PM Tim Allison wrote: > If rmeta/text is not returning text extracted from embedded files that’s a > bug. > > I don’t think /rmeta/all is a thing. > > On

Re: Meta output format of tika server /unpack/all

2024-03-21 Thread Tim Allison
If rmeta/text is not returning text extracted from embedded files that’s a bug. I don’t think /rmeta/all is a thing. On Thu, Mar 21, 2024 at 5:21 PM Zig Zag wrote: > Thanks Josh, thats correct but rmeta/text allows you to control this but > it only returns one level of text (not documents

Re: About Tika 2.9.2 release date

2024-03-21 Thread Tim Allison
Doh! 'Tis the season. PDFBox has started their release cycle. Let's wait for PDFBox 2.0.31. On Thu, Mar 21, 2024 at 11:54 AM Tim Allison wrote: > All, > > I kicked off the regression tests for 2.9.2. I'm aiming for tomorrow for > an rc1. Again, let me know if there are any block

Re: About Tika 2.9.2 release date

2024-03-21 Thread Tim Allison
All, I kicked off the regression tests for 2.9.2. I'm aiming for tomorrow for an rc1. Again, let me know if there are any blockers or other things we need to get into 2.9.2. Thank you! Best, Tim On Wed, Mar 20, 2024 at 2:00 PM Tim Allison wrote: > Fellow devs and community, >

Re: Running tika-server and I need this check to NOT happen

2024-03-21 Thread Tim Allison
t" tika configs just to get the version number in tika-server. That issue fixes that problem. This fix may improve the loading speed of tika-server, too. :D On Wed, Mar 20, 2024 at 3:54 PM Tim Allison wrote: > Looking at TikaConfig, it looks like the "excluded" parsers are

Re: Tika-parser not able to parse specific content

2024-03-21 Thread Tim Allison
If you know that you're only parsing text files, you could configure only the TextOrCSVParser and specify that it processes "application/octet". This should force every file to be processed by that parser. Something like this? application/octet-stream Or you could tell

Re: Memory exception

2024-03-21 Thread Tim Allison
Hi Chetan, Need more info... An eml file contains 465MB of XML and an MP4? How big is the mp4? Are you getting the same behavior with {{java -jar tika-app.jar -J -t big.eml}}? From the stacktrace, can you tell where the final straw is in memory allocation? Are you able to share the file with me

Re: Running tika-server and I need this check to NOT happen

2024-03-21 Thread Tim Allison
We should also fix this: https://issues.apache.org/jira/browse/TIKA-4216 On Thu, Mar 21, 2024 at 9:13 AM Tim Allison wrote: > This is one problem: https://issues.apache.org/jira/browse/TIKA-4215 > > On Wed, Mar 20, 2024 at 3:25 PM Josh Burchard > wrote: > >> Hi all, >

Re: Running tika-server and I need this check to NOT happen

2024-03-21 Thread Tim Allison
This is one problem: https://issues.apache.org/jira/browse/TIKA-4215 On Wed, Mar 20, 2024 at 3:25 PM Josh Burchard wrote: > Hi all, > > I've got Tika 2.9.1 server running on Linux and Tika is checking for the > presence of ImageMagick. I tried disabling the TesseractOCR parser in my > xml

Re: Running tika-server and I need this check to NOT happen

2024-03-20 Thread Tim Allison
Looking at TikaConfig, it looks like the "excluded" parsers are actually loaded and initialized, but they are not added to the composite parser if they're on the exclude list. We should try to avoid loading them at all if they are excluded. IIRC, this is a bit complex in TikaConfig. Let me take a

Re: Tika-parser not able to parse specific content

2024-03-20 Thread Tim Allison
I'm wondering if we can tighten the detection to include a newline after the P2, etc. It looks like we require a new line for some of those file format variants. Let me do some research, unless anyone happens to know. On Mon, Mar 18, 2024 at 4:40 PM Kashif Khan wrote: > Hi, > I tried

Re: About Tika 2.9.2 release date

2024-03-20 Thread Tim Allison
Fellow devs and community, I'd like to fix TIKA-4211 before the next release. It has been a while since our last 2.x release. What do you think about aiming for starting the voting process early next week? Any other blockers? On Tue, Mar 19, 2024 at 7:49 PM Shu Peng wrote: > Dear Tika Team, >

Re: Questions on rmeta and pipes

2024-03-12 Thread Tim Allison
ank you!. Will definitely provide feedback. > > While this get into 3.0 officially is there something I can prototype with > /rmeta to help me get my other stuff working - any suggestions on approach > or a draft PR for the official feature would be very helpful > > On Tue, Mar 12

Re: Questions on rmeta and pipes

2024-03-12 Thread Tim Allison
Stay tuned! Coming soon: https://issues.apache.org/jira/browse/TIKA-4207 I think I'll be wiring this into the /pipes and /async endpoints. The json request will specify that you want bytes AND text+metadata. There will be two options: a) you specify two emitters: one for json and one for raw

Re: Replacing full tika-app.jar to directly using tiki-core / and parsers

2024-03-08 Thread Tim Allison
Hi Brian, A few thoughts: 1) tika-app is basically tika-core + tika-parsers-standard-package. Which components are you trying to avoid? tika-serialization and jackson? boilerpipecontenthandler and some of its dependencies? I ask, because we could factor out a tika-app-core with no parsers in

Re: Question about error when upgradking to 2.9.1 from 2.2.0

2024-02-08 Thread Tim Allison
How are you managing dependencies? Are you bringing in tika-app? Are you using tika-parsers-standard-package? I agree with Tilman that there's likely an older version of pdfbox on your classpath. On Thu, Feb 8, 2024 at 10:22 AM Tilman Hausherr wrote: > No, that one is fine. Could it be that a

Re: Replacing tika server default parser with my custom default parser

2024-02-05 Thread Tim Allison
. :D Best, Tim [0] https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters#ModifyingContentWithHandlersAndMetadataFilters-4.AutoDetectParserConfig On Sun, Feb 4, 2024 at 2:27 PM Tim Allison wrote: > W00t! Let us know when you have m

Re: Replacing tika server default parser with my custom default parser

2024-02-04 Thread Tim Allison
W00t! Let us know when you have more questions! On Fri, Feb 2, 2024 at 10:27 AM Slava G wrote: > Thanks a lot !! > > On Thu, Feb 1, 2024 at 6:36 PM Tim Allison wrote: > >> https://tika.apache.org/3.0.0-BETA/parser_guide.html >> >> You should be able to add

Re: Replacing tika server default parser with my custom default parser

2024-02-01 Thread Tim Allison
https://tika.apache.org/3.0.0-BETA/parser_guide.html You should be able to add your parser in a services file, and the way the class loading sorting works, non-tika parsers should have a higher priority automatically. If that doesn't work, we can update the documentation to show what that would

3.0.0 release and deprecation planning for 2.x

2024-01-31 Thread Tim Allison
All, I'd like to run a final html eval comparing Tika 2.x (tagsoup) and 3.x (jsoup) on TIKA-4185. Other than that, are there any blockers or things that we need to get into 3.0.0 before we make the first release? If there aren't any blockers, I can aim for a 3.0.0-rc1 probably towards the

Re: Parser removes file content and treats it as Metadata

2024-01-25 Thread Tim Allison
ot;message/rfc822"}] On Thu, Jan 25, 2024 at 2:11 PM Gerardo Hernandez wrote: > Hi Ken, > > Unfortunately enforcing Tika to use TXTParser does not solve our problem > at all, I mean it would work for very simple emails, but we also want to be > able to parse emails with embedded r

Re: Issue with HttpFetcher in Tika - URISyntaxException due to Unencoded Characters in URL

2024-01-04 Thread Tim Allison
That's not good. Thank you for sharing this with us: https://issues.apache.org/jira/browse/TIKA-4178 On Fri, Dec 22, 2023 at 11:18 AM João Domingues wrote: > Dear Tika Team, > > I am writing to report an issue encountered while using Apache Tika's > HttpFetcher functionality, specifically

Re: HTML Parsing Changes in 3.0.0-BETA

2024-01-02 Thread Tim Allison
Please, please let us know of any other problems you find! And, thank you, again. On Tue, Jan 2, 2024 at 8:58 AM Tim Allison wrote: > > yakivy opened a PR for exactly this: https://github.com/apache/tika/pull/1522 > > Once the checks pass, I'll merge that, and we should be good to

Re: HTML Parsing Changes in 3.0.0-BETA

2024-01-02 Thread Tim Allison
yakivy opened a PR for exactly this: https://github.com/apache/tika/pull/1522 Once the checks pass, I'll merge that, and we should be good to go. Thank you so much for letting us know of this bug. On Mon, Dec 18, 2023 at 7:57 AM Andreas Hubold wrote: > > Hi, > > I'm currently testing the

Re: [ANNOUNCE] Apache Tika 3.0.0-BETA released

2023-12-15 Thread Tim Allison
er-image-module and tika-parser-pdf-module? > > Cheers, > Stephen. > > On 13/12/2023 14:40, Tim Allison wrote: > > The Apache Tika project is pleased to announce the release of Apache > > Tika 3.0.0-BETA. The release contents have been pushed out to the main > > Apache r

[ANNOUNCE] Apache Tika 3.0.0-BETA released

2023-12-13 Thread Tim Allison
ted CVEs: CVE-2023-6481/CVE-2023-6378. NOTE: This release requires Java 11. We plan to support the 2.x branch (which requires Java 8) for six months after the release of 3.0.0. -- Tim Allison, on behalf of the Apache Tika community

[RESULT][VOTE] Release Apache Tika 3.0.0-BETA Candidate #1

2023-12-13 Thread Tim Allison
The vote has passed with three PMC +1s and no -1s. +1s Konstantin Gribov Tilman Hausherr Tim Allison Thank you all. I'll try to push the artifacts and update the website shortly. Best, Tim On Mon, Dec 11, 2023 at 3:45 PM Tim Allison wrote: > Thank you, Konstantin! > > On Mo

Re: [VOTE] Release Apache Tika 3.0.0-BETA Candidate #1

2023-12-11 Thread Tim Allison
ype=maven=ch.qos.logback%2Flogback-core_source=ossindex-client_medium=integration_content=1.8.1 > [2]: https://logback.qos.ch/news.html#1.3.14 > [3]: https://logback.qos.ch/manual/receivers.html > > -- > Best regards, > Konstantin Gribov. > > > On Mon, Dec 11, 2023

Re: [VOTE] Release Apache Tika 3.0.0-BETA Candidate #1

2023-12-11 Thread Tim Allison
All, We have two +1s. We need another +1 for the release. If a fellow dev has the time to vote, please do! Thank you. Best, Tim On Wed, Dec 6, 2023 at 3:17 PM Tim Allison wrote: > Oops, I forgot to include my +1 for this RC1 for 3.0.0-BETA. Would another > fellow dev be w

Re: tika-server too many open files

2023-12-07 Thread Tim Allison
What version of Tika? Are you running it in Docker or uncontained? On Wed, Dec 6, 2023 at 12:31 PM Mark Kerzner SHMsoft, Inc. < mark.kerz...@shmsoft.com> wrote: > Hi, > > I get this error: > > /tmp/apache-tika-server-forked-tmp-13502607376096852844: Too many open > files > > Can you please help?

Re: [VOTE] Release Apache Tika 3.0.0-BETA Candidate #1

2023-12-06 Thread Tim Allison
man > > > > On 01.12.2023 18:25, Tim Allison wrote: > > A candidate for the Tika 3.0.0-BETA release is available at: > > https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA > > > > The release candidate is a zip archive of the sources in: > > https://gi

[VOTE] Release Apache Tika 3.0.0-BETA Candidate #1

2023-12-01 Thread Tim Allison
A candidate for the Tika 3.0.0-BETA release is available at: https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/3.0.0-BETA-rc1/ The SHA-512 checksum of the archive is

Re: [EXTERNAL] Re: Tika parser not parsing email content

2023-11-06 Thread Tim Allison
se/TIKA-4153. > How do you think I can proceed with the parsing of the document, is there > a latest version I can download? Where exactly I can find this version to > download? > > Thanks, > Kashif > > > On Wed, Oct 11, 2023 at 1:46 AM Tim Allison wrote: > >>

Re: How to process metadata returned by the tika server?

2023-11-03 Thread Tim Allison
A heavier-weight option is to use the tika-serialization module (which uses Jackson databind) and do something like this: https://github.com/apache/tika/blob/main/tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/RecursiveMetadataResourceTest.java#L89 On Fri, Nov 3,

Re: Tika OOM issue

2023-10-26 Thread Tim Allison
Request limits can fairly easily be implemented outside of Tika, but > resource isolation is not, so having a solution for that as well would be > very nice. > Isolation as in pipes? One file per forked jvm at a time? > This SO (https://stackoverflow.com/questions/52495429/setting-xxmaxram) >>

Virtual Apache Tika meetup in celebration of World Digital Preservation Day

2023-10-26 Thread Tim Allison
I'm throwing an intro to Tika hands-on workshop on Nov 2 to celebrate World Digital Preservation Day: https://www.meetup.com/apache-tika-community/events/296969821/ Please let me know if anyone on the list would like to organize or speak at a Tika-based meetup going forward. Cheers,

Re: Tika OOM issue

2023-10-26 Thread Tim Allison
>return 503 if > requests Oops, 429, I'd guess? On Thu, Oct 26, 2023 at 9:33 AM Tim Allison wrote: > > > The flags -XX:MaxRAMPercentage or -XX:MaxRAM should do the trick, but am > > investigating if they are enforced fast enough before the system OOM kicks >

Re: Tika OOM issue

2023-10-26 Thread Tim Allison
> The flags -XX:MaxRAMPercentage or -XX:MaxRAM should do the trick, but am > investigating if they are enforced fast enough before the system OOM kicks > in. So far I would say that is not the case. This SO (https://stackoverflow.com/questions/52495429/setting-xxmaxram) suggests that maxRAM is

Re: Tika OOM issue

2023-10-24 Thread Tim Allison
Sorry for my delay. > My preliminary conclusion is that the jvm is not able to enforce these flags > 100% of the time quickly enough before the cgroup limits kick in and the > kernel oom kicks in. Did anyone else experience this., Y, that's my guess as well. There's a chance that some parsers

Re: [ANNOUNCE] Apache Tika 2.9.1 released

2023-10-20 Thread Tim Allison
I released the docker image, just now, too. Lewis or anyone else who has knowledge + time to release the helm chart, go forth! On Fri, Oct 20, 2023 at 2:38 PM Tim Allison wrote: > > The Apache Tika project is pleased to announce the release of Apache > Tika 2.9.1. The release contents

[ANNOUNCE] Apache Tika 2.9.1 released

2023-10-20 Thread Tim Allison
, visit the project home page: https://tika.apache.org/ -- Tim Allison, on behalf of the Apache Tika community

Re: Release timing for 2.9.1 and 3.0.0-beta?

2023-10-20 Thread Tim Allison
Thank you, Keith. On Tue, Oct 17, 2023 at 11:17 PM Keith Bennett wrote: > Hi, Tim. You have given so, so much to this project. As far as I'm > concerned, you *never* need to say you're sorry. ;) > > - Keith > > > On Oct 17, 2023, at 2:59 AM, Tim Allison wrote: >

[RESULT][VOTE] Release Apache Tika 2.9.1 Candidate #1

2023-10-20 Thread Tim Allison
The vote has passed with 3 +1s and no -1s. +1s Tim Allison Tilman Hausherr Oleg Tikhonov Thank you, all! I'll release the artifacts and update the website shortly. On Thu, Oct 19, 2023 at 6:52 AM Tim Allison wrote: > My belated +1. Reports are here: > https://corpora.tika.apache.or

[VOTE] Release Apache Tika 2.9.1 Candidate #1

2023-10-17 Thread Tim Allison
A candidate for the Tika 2.9.1 release is available at: https://dist.apache.org/repos/dist/dev/tika/2.9.1 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/2.9.1-rc1 The SHA-512 checksum of the archive is

Release timing for 2.9.1 and 3.0.0-beta?

2023-10-16 Thread Tim Allison
All, We detected and fixed an area for improvement in the version of POI that we just upgraded to (https://bz.apache.org/bugzilla/show_bug.cgi?id=67767). I should have caught this in earlier regression tests before the release of POI, but I clearly botched that comparison run. I'm sorry. My

Re: 2.9.1 release?

2023-10-16 Thread Tim Allison
at, Oct 14, 2023 at 7:16 AM Tim Allison wrote: > Looks like we have a bunch of new > "org.apache.poi.util.RecordFormatException: Tried to allocate an array of > length 10,xxx,xxx, but the maximum length for this record type is

Re: 2.9.1 release?

2023-10-16 Thread Tim Allison
w nothing. > > Tilman > > On 14.10.2023 13:16, Tim Allison wrote: > > Looks like we have a bunch of new > "org.apache.poi.util.RecordFormatException: Tried to allocate an array of > length 10,xxx,xxx, but the maximum length for this record type is > 10,000,000."

Re: 2.9.1 release?

2023-10-14 Thread Tim Allison
regression tests didn't pick this up. The changes in rfc822 detection have also had some effects. The few handfuls that I've reviewed are actually positive changes. I'll review systematically on Monday. On Sat, Oct 14, 2023 at 6:35 AM Tim Allison wrote: > Reports are

Re: 2.9.1 release?

2023-10-14 Thread Tim Allison
Reports are here: https://corpora.tika.apache.org/base/reports/tika-2.9.1-reports.tgz I haven't had a chance to look at them yet. :( Will take a look early Monday (ET). On Wed, Oct 11, 2023 at 10:24 AM Tim Allison wrote: > Unless there are objections, I'll kick off the 2.9.1 regression te

Re: Converting Tika 1.x (as library) to Tika 2.x (as server) with custom classes

2023-10-13 Thread Tim Allison
The custom parsers and contenthandler can be configured via Tika-config. We don’t yet have a way to configure AbstractRecursive… or the DocumentSelector. Note that 3.x beta should be out soon. Aside from requiring Java 11, there aren’t big changes in 3.x. I’ll dig up examples when I’m back to a

Re: [EXTERNAL] 2.9.1 release?

2023-10-11 Thread Tim Allison
; > - Original message - > From: "Tim Allison" > To: "Tika User" , "" < > d...@tika.apache.org> > Cc: > Subject: [EXTERNAL] 2.9.1 release? > Date: Wed, Oct 11, 2023 10:25 AM > > Unless there are objections, I'll kick o

2.9.1 release?

2023-10-11 Thread Tim Allison
Unless there are objections, I'll kick off the 2.9.1 regression tests shortly. I just cherry-picked TIKA-4153 into 2.x...will be interesting to see how that works. Best, Tim On Tue, Oct 10, 2023 at 1:37 PM Tim Allison wrote: > All, > Nandita's email didn't go through fo

Re: [EXTERNAL] Re: Tika parser not parsing email content

2023-10-10 Thread Tim Allison
y when we index their file attachments. Is there a Jira > item where I can read about the reason behind its current implementation? > -Josh/HCL > > > > > From:"Tim Allison" > To:user@tika.apache.org > Date:10/10/2023 12:47 PM > Subje

Requesting Tika Server release: commons-compress vulnerability

2023-10-10 Thread Tim Allison
gwf-w82q-5jrr> This is due to use of Tika Server 2.9.0 (Apache Tika – Apache Tika 1.27 <https://tika.apache.org/2.9.0/index.html>), which has commons-compress as a dependency. I saw that Tim Allison recently updated this* commons-compress* version in the Github mirror repo: TIKA-412

Re: Tika parser not parsing email content

2023-10-10 Thread Tim Allison
I can confirm this is still happening in our main/3.x branch. As you probably guessed, the issue is that the file is identified as an email and then parsed as if it were one. If you know that all you have are plain text files, you might consider using the TextAndCSVParser or just the TXTParser.

Re: Error after intall: "Failed to start LSB: Controls Apache Tika as a Service"

2023-10-06 Thread Tim Allison
I opened: https://issues.apache.org/jira/browse/TIKA-4152 to track this. On Thu, Oct 5, 2023 at 5:08 PM Tom Conlon wrote: > Forgot to add that on the same machine Solr installs and runs fine > Started LSB: Controls Apache Solr as a Service. > > plus have installed tika ok in the past but the

Re: Error after intall: "Failed to start LSB: Controls Apache Tika as a Service"

2023-10-06 Thread Tim Allison
I regret I haven't had a chance to look at this. We got a similar email a month ago: https://lists.apache.org/thread/mnf3pxlmvdy456v4s2b8r7mv3khl3msk Which versions of Tika last worked for you? Was it a 2.x, or did we break something in the 2.x branch? On Thu, Oct 5, 2023 at 5:08 PM Tom Conlon

Re: Tika server mode - /async task tracking

2023-10-05 Thread Tim Allison
Will definitely explore > the Pipes reporter and the unit tests, if I have something worthy > documenting in the end I will give you a shout on here. > > Many thanks, > Georgi > > On Tue, 3 Oct 2023, 20:38 Tim Allison, wrote: > >> I'm sorry for my delay. >> >>

Re: Tika server mode - /async task tracking

2023-10-03 Thread Tim Allison
I'm sorry for my delay. At some point, I was thinking about implementing: /async/ but I gave up. The problem was that I didn't want to have to tie caching/storing status info into tika-server or the async processor -- so I created a configurable PipesReporter class...see below. If you set up

Re: [External] Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-14 Thread Tim Allison
Y, totally get it. How about shortening the EOL of Tika 2.x (and Java 8) to 6 months after the Tika 3.x/Java 11 release? On Thu, Sep 14, 2023 at 5:41 AM Sandeep Kulkarni wrote: > As a long time user of Tika, I would like to suggest Java 11 should be > supported for 3.x. Java 17 is still quite

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-13 Thread Tim Allison
We seem to have consensus on Java 11 for 3.x and keep Java 8 for 2.x for one more year. I've started the branches and started making some changes in this direction. Is it worth pushing this modernization further or faster, with either: a) Jump to Java 17 now and keep Java 8 in 2.x for one more

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-13 Thread Tim Allison
>> >> +1 from our side, we moved to java 11 last year. >> >> Best, >> Luis >> >> >> Em ter, 12 de set de 2023 19:01, Ken Krugler >> escreveu: >> >>> +1 >>> >>> On Sep 12, 2023, at 7:56 AM, Tim Allison wrote: >

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-12 Thread Tim Allison
fixes only" 6 months after the first release of 3.0.0? In 3.x, we'd require Java 11 and jakarta. We wouldn't make many other major changes. On Tue, Sep 12, 2023 at 10:49 AM Tim Allison wrote: > >If Tika users will be happy to move on and drop Java 8 and/or javax. > Please drop t

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-12 Thread Tim Allison
>If Tika users will be happy to move on and drop Java 8 and/or javax. Please drop them :))) Fellow devs and broader Tika community, are we ok with EOL'ing Tika 2.x and dropping support for Java 8 and javax in September 2024?

[ANNOUNCE] Apache Tika 2.9.0 released

2023-08-28 Thread Tim Allison
, visit the project home page: https://tika.apache.org/ -- Tim Allison, on behalf of the Apache Tika community

[RESULT][VOTE] Release Apache Tika 2.9.0 Candidate #1

2023-08-28 Thread Tim Allison
This vote passes with 4 binding +1s, 1 non-binding +1 and no -1s. Binding +1s: Tim Allison Konstantin Gribov Tilman Hausherr Oleg Tikhonov Non-binding +1 Julien Nioche I'll release the artifacts and update the website shortly. Thank you, all! Best, Tim On Mon, Aug 28, 2023 at 10:13 AM

Re: [VOTE] Release Apache Tika 2.9.0 Candidate #1

2023-08-28 Thread Tim Allison
; > Julien > > On Wed, 23 Aug 2023 at 15:50, Tim Allison wrote: > >> A candidate for the Tika 2.9.0 release is available at: >> https://dist.apache.org/repos/dist/dev/tika/2.9.0 >> >> The release candidate is a zip archive of the sources in: >> https://gi

[VOTE] Release Apache Tika 2.9.0 Candidate #1

2023-08-23 Thread Tim Allison
A candidate for the Tika 2.9.0 release is available at: https://dist.apache.org/repos/dist/dev/tika/2.9.0 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/2.9.0-rc1/ The SHA-512 checksum of the archive is

Re: Parser modifying file's access time

2023-08-23 Thread Tim Allison
I think this is a linux issue, not a Tika issue, e.g.: https://superuser.com/questions/464290/why-is-cat-not-changing-the-access-time On Mon, Aug 21, 2023 at 10:03 PM Gerardo Hernandez wrote: > Hi, > > I’d like to undestand why I’m experiencing the following behavior and if > it’s expected (as

Re: setMaxContentLength Behavior Differs Across Parsers?

2023-08-14 Thread Tim Allison
Is it possible that this is due to extra whitespace in the PDF? On Sun, Jul 30, 2023 at 2:17 PM Keith Bennett wrote: > Hi, all. I am finally getting around to updating the "rika" Ruby gem for > interacting with Tika in JRuby, and encountered something weird. When I > test parsing a text file

Re: Content-Type Metadata

2023-08-14 Thread Tim Allison
Yes. Let us know if you find otherwise! On Mon, Aug 14, 2023 at 11:16 AM Keith Bennett wrote: > Tim, thank you so much for responding. Can I rely on Content-Type to > always be populated by a parse? > > - Keith > > > On Mon, Aug 14, 2023 at 10:09 PM Tim Allison wrote:

Re: Using Tika with another OCR engine

2023-08-14 Thread Tim Allison
Concur with Nick. And, y, I'd frankly copy the TesseractOCRParser into a new module, rename it and modify it to call your OCR engine, build the jar and add the dependency to your tika bin directory (if you're using Docker?). On Thu, Aug 10, 2023 at 3:45 AM Cristian Zamfir wrote: > Hi Nick, > >

Re: Content-Type Metadata

2023-08-14 Thread Tim Allison
Content-Type may be more reliable/specific because for some file types, the parser updates the file type during the parse. For example the PDF parser updates application/pdf -> application/illustrator (or similar?) if the parser determines that the file is a PDF-based Adobe Illustrator file. The

Re: Tika 2.7.0 Java version supports

2023-07-25 Thread Tim Allison
We use 17 in our docker images and one of our ci/cd pipelines uses 18. Are you having problems? On Tue, Jul 25, 2023 at 6:00 AM Slava G wrote: > Hi, > Silly question, what is the highest Java version that Tika 2.7.0 supports ? > > Thanks >

Re: Tika memory usage using watchdog

2023-07-25 Thread Tim Allison
that's configured. On Tue, Jul 25, 2023 at 7:43 AM Cristian Zamfir wrote: > Hello, > > On 21 Jul 2023 at 23:51:54, Cristian Zamfir wrote: > >> Hi Tim! >> >> Sorry for the lack of details, adding now. >> >> On 21 Jul 2023 at 18:56:02, Tim Allison wrote: &

Re: Tika memory usage using watchdog

2023-07-21 Thread Tim Allison
Sorry, I'm not sure I understand precisely what's going on. First, are you running tika-server, tika-app, tika-async, or running Tika programmatically? I'm guessing tika-server because you've containerized it, but I've containerized tika-async...so...? :D If tika-server, are you sending

Re: post request with shift_jis encoding and filename hint

2023-07-18 Thread Tim Allison
:24 PM vijaya Panchak wrote: > The headers were missing in second part for attachment call. > Is that by mistake as you may be getting invalid response to parsing error > at the server side? > > On Jul 18, 2023, at 8:55 AM, Tim Allison wrote: > >  > Hi Medea, > I'm sor

Re: post request with shift_jis encoding and filename hint

2023-07-18 Thread Tim Allison
g project I don't want to > implement such a logic to add a content-type. I could add a default > content-type like application/octed-stream, but I don't want to guess the > type before I send the file to Tika. > Is there a workaround or something like that? Or is this a bug in T

Re: Text extraction using /tika for container document only

2023-07-14 Thread Tim Allison
I updated our documentation on our wiki: https://cwiki.apache.org/confluence/display/TIKA/TikaServer It looks like we had already documented the maxEmbeddedResources for /rmeta, but the documentation now covers X-Tika-Skip-Embedded as well. On Fri, Jul 14, 2023 at 11:57 AM Tim Allison wrote

Re: Text extraction using /tika for container document only

2023-07-14 Thread Tim Allison
Two follow ups... 1) TIKA-3227 was Dave Meikle's addition to skip embedded for /tika. Add a header X-Tika-Skip-Embedded with value 'true'. 2) You can get just the text content with /rmeta via /rmeta/text On Thu, Jul 13, 2023 at 4:30 PM Tim Allison wrote: > Sorry for my delay. > For /t

Re: Text extraction using /tika for container document only

2023-07-13 Thread Tim Allison
Sorry for my delay. For /tika, I thought we had a way to tell it to parse only the primary document and skip the attachments, but I can't figure out how to do that quickly now. I'll look around some more. With /rmeta, try setting a header `maxEmbeddedResources:0` On Fri, Jul 7, 2023 at 5:06 AM

Re: Recognizing generic XML?

2023-07-13 Thread Tim Allison
I wasn't around on the project when the xml mime magic was developed. So, take this as personal opinion, not an official statement. :D The first item is intentional (xml data with no declaration). Text-based files are challenging, and looking for matching tags is beyond what our current

Re: Seeing errors in Apache-tika latest build with maven

2023-07-06 Thread Tim Allison
>The example on tika-app.jar just outputs the input file contents and the options that are listed are limited, What are you trying to do? What are your goals? On Thu, Jul 6, 2023 at 1:38 PM vijaya Panchak wrote: > With -DskipTests it works and generates jar files. > I will check that. > > I

Re: Seeing errors in Apache-tika latest build with maven

2023-07-06 Thread Tim Allison
You can turn off the ossindex checks by adding (we mention this in the "Building" section of https://github.com/apache/tika) -Dossindex.skip On Thu, Jul 6, 2023 at 12:02 PM vijaya Panchak wrote: > Sending again, > > On Jul 6, 2023, at 8:08 AM, vijaya Panchak > wrote: > >  >  > >  >> Hi,

Re: TikaConfig Threadsafe?

2023-06-29 Thread Tim Allison
It is threadsafe. Let us know if you find otherwise! On Thu, Jun 29, 2023 at 7:03 AM Darren wrote: > Hi, team: > > Is TikaConfig threadsafe? Every time we init it from local > tika-config.xml, it will costs times. >

  1   2   3   4   5   6   7   >