Thank you for raising this issue. Please re-request a jira account, and
we'll accept it. Sorry about that.
On Wed, Sep 25, 2024 at 11:06 AM Ruairidh Williamson <
ruairidh.william...@nextdlp.com> wrote:
> Hello,
>
> We are using tika to extract text from XPS files and have hit an issue
> where whi
ve to integrate it. :D
On Sat, Sep 14, 2024 at 6:58 AM Tim Allison wrote:
> Try the RecursiveParserWrapper. With tika-app: java -jar tika-app.jar -J
> -t myZip.zip or the /rmeta endpoint with tika-server. Or, coming soon, the
> grpc server will effectively give that output.
>
> On
Try the RecursiveParserWrapper. With tika-app: java -jar tika-app.jar -J -t
myZip.zip or the /rmeta endpoint with tika-server. Or, coming soon, the
grpc server will effectively give that output.
On Fri, Sep 13, 2024 at 10:01 AM David Pilato wrote:
> Hey team,
>
>
> I'm wondering if there is a wa
I agree with Tilman.
If there's a more modern package/model you'd want to use whether server
based or commandline, it is fairly straightforward to add a new parser to
handle your needs.
On Fri, Jul 26, 2024 at 11:50 PM Tilman Hausherr
wrote:
> I don't think so, the closest we have is DL4J but I
0.0.
-- Tim Allison, on behalf of the Apache Tika community
I released the artifacts and built the docker images. I'll work on the
site and announcement tomorrow.
On Mon, Jul 15, 2024 at 1:50 PM Tim Allison wrote:
>
> The vote has passed with 3 PMC +1s, 2 non-binding +1s and no -1s.
>
> +1s (binding)
> Tim Allison
> Nicholas DiP
The vote has passed with 3 PMC +1s, 2 non-binding +1s and no -1s.
+1s (binding)
Tim Allison
Nicholas DiPiazza
Tilman Hausherr
+1s (non-binding)
Kiran Bachu
Gary Gregory
I'll release the artifacts shortly and update the website.
Thank you, all!
Best,
Tim
On Fri, Jul 12, 2024
A candidate for the Tika 3.0.0-BETA2 release is available at:
https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA2
The release candidate is a zip archive of the sources in:
https://github.com/apache/tika/tree/3.0.0-BETA2-rc1/
The SHA-512 checksum of the archive is
8a4142f61110f196c550146637994
I regret that those endpoints do not have a reliable way to link them.
I recently integrated something that does work, but it requires the
tika-pipes framework, which you can use via tika-server.
It will output .json files and a subdirectory of binary files, and there is
a key in the json file th
I regret that we haven't had contributions on the tika as a service scripts
since 1.x. We could really use help.
On Tue, Jun 11, 2024 at 3:37 AM JB Data31 wrote:
>
> No real explanation of this problem, but indeed the service is not really
> installed.
>
>
> *$ service --status-all | grep tika$*
Markus,
I'm sorry for my delay. We're migrating to jsoup in 3.x. I realize that
3.x isn't out yet, but I wanted to give you a heads up.
To extract scripts in 3.x, you'd do something like this:
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-mod
I'm not sure which endpoint you're using, but search for "writeLimit" on
this page: https://cwiki.apache.org/confluence/display/TIKA/TikaServer
As you probably know, many file formats are actually compressed: PDF, docx,
etc. There is no way to know ahead of time for many file formats what the
amou
All,
Many thanks to the many community members who helped figure this out and
get it out the door! As of tika-docker 2.9.2.1, we now have multi-arch
support (and on noble!).
Let us know if there are any surprises. Thank you, again!
Cheers,
Tim
Ref: https://hub.docker.com/r
during the parse in the above section.
On Mon, Apr 29, 2024 at 10:28 AM Tim Allison wrote:
> I agree with Nick.
>
> You can better understand the magic based algorithms we're using for
> detection by searching for mp4 and quicktime in this file:
> https://github.com/apache
I agree with Nick.
You can better understand the magic based algorithms we're using for
detection by searching for mp4 and quicktime in this file:
https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
A middle ground is to have the MP4 parse
HI Claude,
I'd recommend a custom XMLParser for this, perhaps subclass DcXMLParser? We
could also parameterize this in the DcXMLParser if a committer had a chance
to add that feature or review a PR from yoou.
Best,
Tim
On Thu, Apr 18, 2024 at 7:33 AM Claude Warren wrote:
> It seems th
I'm on ubuntu. That's the 2.7.0 pom, obv. I just bumped the versions,
reloaded and ran to see different numbers of parsers in 2.7.0 vs
2.8.0+2.9.0.
On Thu, Apr 4, 2024 at 8:20 AM Tim Allison wrote:
>
> I'm attaching the pom. I can't remember if attachments get strippe
pdf.PDFParser.PASSWORD); on both
> 2.7.0 and 2.8.0+
>
> Thanks, and regards,
> Gerardo
>
> From: Tim Allison
> Sent: Wednesday, April 3, 2024 06:43 AM
> To: user@tika.apache.org
> Subject: Re: AutoDetectParser not working after upgrading f
Y, I'm not able to repro this problem with 2.8.0 or higher. I'm seeing
239 parsers (probably diff from Tilman because of installed external
parsers?).
On Wed, Apr 3, 2024 at 5:09 AM Tilman Hausherr wrote:
>
> On 03.04.2024 08:55, Gerardo Hernandez wrote:
> > On 2.7.0, I get a list of 203 parsers,
I also released our docker images for 2.9.2.0.
How do we update helm?
On Tue, Apr 2, 2024 at 2:31 PM Tim Allison wrote:
>
> The Apache Tika project is pleased to announce the release of Apache
> Tika 2.9.2. The release contents have been pushed out to the main
> Apache release sit
, visit the project home page:
https://tika.apache.org/
-- Tim Allison, on behalf of the Apache Tika community
The vote has passed with 3 PMC +1s and no -1s.
+1s
Oleg Tikhonov
Tilman Hausherr
Tim Allison
I'll release the artifacts shortly and update the website.
Thank you, all!
Best,
Tim
On Tue, Apr 2, 2024 at 12:08 AM Oleg Tikhonov
wrote:
> +1,
> Thanks.
>
> On Mon, 1
n, 16 Oct 2023 at 11:46, Tom Conlon wrote:
>
>> Hi,
>> Would it be possible for the issue "Fix tika as a service"
>> https://issues.apache.org/jira/browse/TIKA-4152
>> to be reviewed before release?
>>
>> Thanks
>> Tom
>>
>> On M
W00t! Thank you Lewis!
On Sat, Mar 30, 2024 at 3:57 PM lewis john mcgibbney
wrote:
> Hi user@, dev@,
>
> For those running Tika on Kubernetes, you can now conveniently find the
> Helm Chart via artifacthub.io
>
> https://artifacthub.io/packages/helm/apache-tika/tika
>
> I’ll build in a little mo
A candidate for the Tika 2.9.2 release is available at:
https://dist.apache.org/repos/dist/dev/tika/2.9.2
The release candidate is a zip archive of the sources in:
https://github.com/apache/tika/tree/2.9.2-rc2/
The SHA-512 checksum of the archive is
5ac7b981aa89d44e177dfb457d6f6b73dd54d43641da31e
embedded raw bytes and the rmeta content.
Not sure what to call that endpoint. Recommendations?
On Thu, Mar 21, 2024 at 6:10 PM Tim Allison wrote:
> If rmeta/text is not returning text extracted from embedded files that’s a
> bug.
>
> I don’t think /rmeta/all is a thing.
>
> On Th
If rmeta/text is not returning text extracted from embedded files that’s a
bug.
I don’t think /rmeta/all is a thing.
On Thu, Mar 21, 2024 at 5:21 PM Zig Zag wrote:
> Thanks Josh, thats correct but rmeta/text allows you to control this but
> it only returns one level of text (not documents embed
Doh! 'Tis the season. PDFBox has started their release cycle. Let's wait
for PDFBox 2.0.31.
On Thu, Mar 21, 2024 at 11:54 AM Tim Allison wrote:
> All,
>
> I kicked off the regression tests for 2.9.2. I'm aiming for tomorrow for
> an rc1. Again, let me know if ther
All,
I kicked off the regression tests for 2.9.2. I'm aiming for tomorrow for an
rc1. Again, let me know if there are any blockers or other things we need
to get into 2.9.2. Thank you!
Best,
Tim
On Wed, Mar 20, 2024 at 2:00 PM Tim Allison wrote:
> Fellow devs and community,
efault" tika configs just
to get the version number in tika-server. That issue fixes that problem.
This fix may improve the loading speed of tika-server, too. :D
On Wed, Mar 20, 2024 at 3:54 PM Tim Allison wrote:
> Looking at TikaConfig, it looks like the "excluded" parsers are
If you know that you're only parsing text files, you could configure only
the TextOrCSVParser and specify that it processes "application/octet". This
should force every file to be processed by that parser. Something like this?
application/octet-stream
Or you could tell tik
Hi Chetan,
Need more info... An eml file contains 465MB of XML and an MP4? How big is
the mp4? Are you getting the same behavior with {{java -jar tika-app.jar -J
-t big.eml}}? From the stacktrace, can you tell where the final straw is in
memory allocation? Are you able to share the file with me of
We should also fix this: https://issues.apache.org/jira/browse/TIKA-4216
On Thu, Mar 21, 2024 at 9:13 AM Tim Allison wrote:
> This is one problem: https://issues.apache.org/jira/browse/TIKA-4215
>
> On Wed, Mar 20, 2024 at 3:25 PM Josh Burchard
> wrote:
>
>> Hi all,
>
This is one problem: https://issues.apache.org/jira/browse/TIKA-4215
On Wed, Mar 20, 2024 at 3:25 PM Josh Burchard wrote:
> Hi all,
>
> I've got Tika 2.9.1 server running on Linux and Tika is checking for the
> presence of ImageMagick. I tried disabling the TesseractOCR parser in my
> xml config
Looking at TikaConfig, it looks like the "excluded" parsers are actually
loaded and initialized, but they are not added to the composite parser if
they're on the exclude list.
We should try to avoid loading them at all if they are excluded. IIRC, this
is a bit complex in TikaConfig. Let me take a
I'm wondering if we can tighten the detection to include a newline after
the P2, etc. It looks like we require a new line for some of those file
format variants. Let me do some research, unless anyone happens to know.
On Mon, Mar 18, 2024 at 4:40 PM Kashif Khan
wrote:
> Hi,
> I tried configurin
Fellow devs and community,
I'd like to fix TIKA-4211 before the next release. It has been a while
since our last 2.x release. What do you think about aiming for starting the
voting process early next week? Any other blockers?
On Tue, Mar 19, 2024 at 7:49 PM Shu Peng wrote:
> Dear Tika Team,
>
ank you!. Will definitely provide feedback.
>
> While this get into 3.0 officially is there something I can prototype with
> /rmeta to help me get my other stuff working - any suggestions on approach
> or a draft PR for the official feature would be very helpful
>
> On Tue, Mar 12
Stay tuned! Coming soon: https://issues.apache.org/jira/browse/TIKA-4207
I think I'll be wiring this into the /pipes and /async endpoints. The json
request will specify that you want bytes AND text+metadata.
There will be two options:
a) you specify two emitters: one for json and one for raw byte
Hi Brian,
A few thoughts:
1) tika-app is basically tika-core + tika-parsers-standard-package. Which
components are you trying to avoid? tika-serialization and jackson?
boilerpipecontenthandler and some of its dependencies? I ask, because we
could factor out a tika-app-core with no parsers in Tik
How are you managing dependencies? Are you bringing in tika-app? Are you
using tika-parsers-standard-package?
I agree with Tilman that there's likely an older version of pdfbox on your
classpath.
On Thu, Feb 8, 2024 at 10:22 AM Tilman Hausherr
wrote:
> No, that one is fine. Could it be that a l
. :D
Best,
Tim
[0]
https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters#ModifyingContentWithHandlersAndMetadataFilters-4.AutoDetectParserConfig
On Sun, Feb 4, 2024 at 2:27 PM Tim Allison wrote:
> W00t! Let us know when you have m
W00t! Let us know when you have more questions!
On Fri, Feb 2, 2024 at 10:27 AM Slava G wrote:
> Thanks a lot !!
>
> On Thu, Feb 1, 2024 at 6:36 PM Tim Allison wrote:
>
>> https://tika.apache.org/3.0.0-BETA/parser_guide.html
>>
>> You should be able to add your
https://tika.apache.org/3.0.0-BETA/parser_guide.html
You should be able to add your parser in a services file, and the way the
class loading sorting works, non-tika parsers should have a higher priority
automatically. If that doesn't work, we can update the documentation to
show what that would lo
All,
I'd like to run a final html eval comparing Tika 2.x (tagsoup) and 3.x
(jsoup) on TIKA-4185. Other than that, are there any blockers or things
that we need to get into 3.0.0 before we make the first release?
If there aren't any blockers, I can aim for a 3.0.0-rc1 probably towards
the beginni
Content-Type":"message/rfc822"}]
On Thu, Jan 25, 2024 at 2:11 PM Gerardo Hernandez
wrote:
> Hi Ken,
>
> Unfortunately enforcing Tika to use TXTParser does not solve our problem
> at all, I mean it would work for very simple emails, but we also want to be
> able to par
That's not good. Thank you for sharing this with us:
https://issues.apache.org/jira/browse/TIKA-4178
On Fri, Dec 22, 2023 at 11:18 AM João Domingues
wrote:
> Dear Tika Team,
>
> I am writing to report an issue encountered while using Apache Tika's
> HttpFetcher functionality, specifically when
Please, please let us know of any other problems you find! And, thank
you, again.
On Tue, Jan 2, 2024 at 8:58 AM Tim Allison wrote:
>
> yakivy opened a PR for exactly this: https://github.com/apache/tika/pull/1522
>
> Once the checks pass, I'll merge that, and we should be go
yakivy opened a PR for exactly this: https://github.com/apache/tika/pull/1522
Once the checks pass, I'll merge that, and we should be good to go.
Thank you so much for letting us know of this bug.
On Mon, Dec 18, 2023 at 7:57 AM Andreas Hubold
wrote:
>
> Hi,
>
> I'm currently testing the upgrade
-parser-image-module and tika-parser-pdf-module?
>
> Cheers,
> Stephen.
>
> On 13/12/2023 14:40, Tim Allison wrote:
> > The Apache Tika project is pleased to announce the release of Apache
> > Tika 3.0.0-BETA. The release contents have been pushed out to the main
> > Apa
ted CVEs: CVE-2023-6481/CVE-2023-6378.
NOTE: This release requires Java 11. We plan to support the
2.x branch (which requires Java 8) for six months after the
release of 3.0.0.
-- Tim Allison, on behalf of the Apache Tika community
The vote has passed with three PMC +1s and no -1s.
+1s
Konstantin Gribov
Tilman Hausherr
Tim Allison
Thank you all. I'll try to push the artifacts and update the website
shortly.
Best,
Tim
On Mon, Dec 11, 2023 at 3:45 PM Tim Allison wrote:
> Thank you, Konstantin!
>
>
nt-type=maven&component-name=ch.qos.logback%2Flogback-core&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
> [2]: https://logback.qos.ch/news.html#1.3.14
> [3]: https://logback.qos.ch/manual/receivers.html
>
> --
> Best regards,
> Konstantin Gribov.
All,
We have two +1s. We need another +1 for the release. If a fellow dev has
the time to vote, please do! Thank you.
Best,
Tim
On Wed, Dec 6, 2023 at 3:17 PM Tim Allison wrote:
> Oops, I forgot to include my +1 for this RC1 for 3.0.0-BETA. Would another
> fellow dev be will
What version of Tika? Are you running it in Docker or uncontained?
On Wed, Dec 6, 2023 at 12:31 PM Mark Kerzner SHMsoft, Inc. <
mark.kerz...@shmsoft.com> wrote:
> Hi,
>
> I get this error:
>
> /tmp/apache-tika-server-forked-tmp-13502607376096852844: Too many open
> files
>
> Can you please help?
man
>
>
>
> On 01.12.2023 18:25, Tim Allison wrote:
> > A candidate for the Tika 3.0.0-BETA release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA
> >
> > The release candidate is a zip archive of the sources in:
> > https://gi
A candidate for the Tika 3.0.0-BETA release is available at:
https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA
The release candidate is a zip archive of the sources in:
https://github.com/apache/tika/tree/3.0.0-BETA-rc1/
The SHA-512 checksum of the archive is
6a98e19f73e0ccf9c902cf869fb50c0c
/browse/TIKA-4153.
> How do you think I can proceed with the parsing of the document, is there
> a latest version I can download? Where exactly I can find this version to
> download?
>
> Thanks,
> Kashif
>
>
> On Wed, Oct 11, 2023 at 1:46 AM Tim Allison wrote:
>
>
A heavier-weight option is to use the tika-serialization module (which uses
Jackson databind) and do something like this:
https://github.com/apache/tika/blob/main/tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/RecursiveMetadataResourceTest.java#L89
On Fri, Nov 3, 2
Request limits can fairly easily be implemented outside of Tika, but
> resource isolation is not, so having a solution for that as well would be
> very nice.
>
Isolation as in pipes? One file per forked jvm at a time?
> This SO (https://stackoverflow.com/questions/52495429/setting-xxmaxram)
>>
>>
I'm throwing an intro to Tika hands-on workshop on Nov 2 to celebrate World
Digital Preservation Day:
https://www.meetup.com/apache-tika-community/events/296969821/
Please let me know if anyone on the list would like to organize or speak at
a Tika-based meetup going forward.
Cheers,
Tim
>return 503 if > requests
Oops, 429, I'd guess?
On Thu, Oct 26, 2023 at 9:33 AM Tim Allison wrote:
>
> > The flags -XX:MaxRAMPercentage or -XX:MaxRAM should do the trick, but am
> > investigating if they are enforced fast enough before the system OOM kicks
> >
> The flags -XX:MaxRAMPercentage or -XX:MaxRAM should do the trick, but am
> investigating if they are enforced fast enough before the system OOM kicks
> in. So far I would say that is not the case.
This SO (https://stackoverflow.com/questions/52495429/setting-xxmaxram)
suggests that maxRAM is
Sorry for my delay.
> My preliminary conclusion is that the jvm is not able to enforce these flags
> 100% of the time quickly enough before the cgroup limits kick in and the
> kernel oom kicks in. Did anyone else experience this.,
Y, that's my guess as well. There's a chance that some parsers
I released the docker image, just now, too.
Lewis or anyone else who has knowledge + time to release the helm
chart, go forth!
On Fri, Oct 20, 2023 at 2:38 PM Tim Allison wrote:
>
> The Apache Tika project is pleased to announce the release of Apache
> Tika 2.9.1. The release contents
, visit the project home page:
https://tika.apache.org/
-- Tim Allison, on behalf of the Apache Tika community
Thank you, Keith.
On Tue, Oct 17, 2023 at 11:17 PM Keith Bennett
wrote:
> Hi, Tim. You have given so, so much to this project. As far as I'm
> concerned, you *never* need to say you're sorry. ;)
>
> - Keith
>
>
> On Oct 17, 2023, at 2:59 AM, Tim Allison wrote:
The vote has passed with 3 +1s and no -1s.
+1s
Tim Allison
Tilman Hausherr
Oleg Tikhonov
Thank you, all! I'll release the artifacts and update the website shortly.
On Thu, Oct 19, 2023 at 6:52 AM Tim Allison wrote:
> My belated +1. Reports are here:
> https://corpora.tika.apac
A candidate for the Tika 2.9.1 release is available at:
https://dist.apache.org/repos/dist/dev/tika/2.9.1
The release candidate is a zip archive of the sources in:
https://github.com/apache/tika/tree/2.9.1-rc1
The SHA-512 checksum of the archive is
ba13a0d22994ca84cccd9ad2931e099051870d46a5a34402
All,
We detected and fixed an area for improvement in the version of POI that
we just upgraded to (https://bz.apache.org/bugzilla/show_bug.cgi?id=67767).
I should have caught this in earlier regression tests before the release of
POI, but I clearly botched that comparison run. I'm sorry.
My g
at, Oct 14, 2023 at 7:16 AM Tim Allison wrote:
> Looks like we have a bunch of new
> "org.apache.poi.util.RecordFormatException: Tried to allocate an array of
> length 10,xxx,xxx, but the maximum length for this record type is
>
hat value. Also "error" is now nothing.
>
> Tilman
>
> On 14.10.2023 13:16, Tim Allison wrote:
>
> Looks like we have a bunch of new
> "org.apache.poi.util.RecordFormatException: Tried to allocate an array of
&g
the regression tests didn't pick this up.
The changes in rfc822 detection have also had some effects. The few
handfuls that I've reviewed are actually positive changes. I'll review
systematically on Monday.
On Sat, Oct 14, 2023 at 6:35 AM Tim Allison wrote
Reports are here:
https://corpora.tika.apache.org/base/reports/tika-2.9.1-reports.tgz
I haven't had a chance to look at them yet. :( Will take a look early
Monday (ET).
On Wed, Oct 11, 2023 at 10:24 AM Tim Allison wrote:
> Unless there are objections, I'll kick off the 2.9.1 reg
The custom parsers and contenthandler can be configured via Tika-config. We
don’t yet have a way to configure AbstractRecursive… or the
DocumentSelector.
Note that 3.x beta should be out soon.
Aside from requiring Java 11, there aren’t big changes in 3.x.
I’ll dig up examples when I’m back to a
;
>
>
> - Original message -
> From: "Tim Allison"
> To: "Tika User" , "" <
> d...@tika.apache.org>
> Cc:
> Subject: [EXTERNAL] 2.9.1 release?
> Date: Wed, Oct 11, 2023 10:25 AM
>
> Unless there are objections, I'll k
Unless there are objections, I'll kick off the 2.9.1 regression tests
shortly. I just cherry-picked TIKA-4153 into 2.x...will be interesting to
see how that works.
Best,
Tim
On Tue, Oct 10, 2023 at 1:37 PM Tim Allison wrote:
> All,
> Nandita's email didn'
> this loss of fidelity when we index their file attachments. Is there a Jira
> item where I can read about the reason behind its current implementation?
> -Josh/HCL
>
>
>
>
> From:"Tim Allison"
> To:user@tika.apache.org
> Date:10/10
ies/GHSA-cgwf-w82q-5jrr>
This is due to use of Tika Server 2.9.0 (Apache Tika – Apache Tika 1.27
<https://tika.apache.org/2.9.0/index.html>), which has commons-compress as
a dependency. I saw that Tim Allison recently updated this*
commons-compress* version in the Github mirror repo
I can confirm this is still happening in our main/3.x branch. As you
probably guessed, the issue is that the file is identified as an email and
then parsed as if it were one. If you know that all you have are plain
text files, you might consider using the TextAndCSVParser or just the
TXTParser.
O
I opened: https://issues.apache.org/jira/browse/TIKA-4152 to track this.
On Thu, Oct 5, 2023 at 5:08 PM Tom Conlon wrote:
> Forgot to add that on the same machine Solr installs and runs fine
> Started LSB: Controls Apache Solr as a Service.
>
> plus have installed tika ok in the past but the las
I regret I haven't had a chance to look at this. We got a similar email a
month ago: https://lists.apache.org/thread/mnf3pxlmvdy456v4s2b8r7mv3khl3msk
Which versions of Tika last worked for you? Was it a 2.x, or did we break
something in the 2.x branch?
On Thu, Oct 5, 2023 at 5:08 PM Tom Conlon
Will definitely explore
> the Pipes reporter and the unit tests, if I have something worthy
> documenting in the end I will give you a shout on here.
>
> Many thanks,
> Georgi
>
> On Tue, 3 Oct 2023, 20:38 Tim Allison, wrote:
>
>> I'm sorry for my delay.
>>
>
I'm sorry for my delay.
At some point, I was thinking about implementing: /async/ but I
gave up. The problem was that I didn't want to have to tie caching/storing
status info into tika-server or the async processor -- so I created a
configurable PipesReporter class...see below.
If you set up log
Y, totally get it.
How about shortening the EOL of Tika 2.x (and Java 8) to 6 months after the
Tika 3.x/Java 11 release?
On Thu, Sep 14, 2023 at 5:41 AM Sandeep Kulkarni
wrote:
> As a long time user of Tika, I would like to suggest Java 11 should be
> supported for 3.x. Java 17 is still quite n
We seem to have consensus on Java 11 for 3.x and keep Java 8 for 2.x for
one more year. I've started the branches and started making some changes
in this direction.
Is it worth pushing this modernization further or faster, with either:
a) Jump to Java 17 now and keep Java 8 in 2.x for one more ye
ote:
>>
>> +1 from our side, we moved to java 11 last year.
>>
>> Best,
>> Luis
>>
>>
>> Em ter, 12 de set de 2023 19:01, Ken Krugler
>> escreveu:
>>
>>> +1
>>>
>>> On Sep 12, 2023, at 7:56 AM, Tim Allison wrote
fixes only" 6 months after the first
release of 3.0.0?
In 3.x, we'd require Java 11 and jakarta. We wouldn't make many other
major changes.
On Tue, Sep 12, 2023 at 10:49 AM Tim Allison wrote:
> >If Tika users will be happy to move on and drop Java 8 and/or javax.
> Ple
>If Tika users will be happy to move on and drop Java 8 and/or javax.
Please drop them :)))
Fellow devs and broader Tika community, are we ok with EOL'ing Tika 2.x and
dropping support for Java 8 and javax in September 2024?
, visit the project home page:
https://tika.apache.org/
-- Tim Allison, on behalf of the Apache Tika community
This vote passes with 4 binding +1s, 1 non-binding +1 and no -1s.
Binding +1s:
Tim Allison
Konstantin Gribov
Tilman Hausherr
Oleg Tikhonov
Non-binding +1
Julien Nioche
I'll release the artifacts and update the website shortly. Thank you, all!
Best,
Tim
On Mon, Aug 28, 2023 at 10:
elease
>
> Julien
>
> On Wed, 23 Aug 2023 at 15:50, Tim Allison wrote:
>
>> A candidate for the Tika 2.9.0 release is available at:
>> https://dist.apache.org/repos/dist/dev/tika/2.9.0
>>
>> The release candidate is a zip archive of the sources in:
>> htt
A candidate for the Tika 2.9.0 release is available at:
https://dist.apache.org/repos/dist/dev/tika/2.9.0
The release candidate is a zip archive of the sources in:
https://github.com/apache/tika/tree/2.9.0-rc1/
The SHA-512 checksum of the archive is
4b54172163a2e86b805e7077b11d21902dc2137a849eb0d
I think this is a linux issue, not a Tika issue, e.g.:
https://superuser.com/questions/464290/why-is-cat-not-changing-the-access-time
On Mon, Aug 21, 2023 at 10:03 PM Gerardo Hernandez
wrote:
> Hi,
>
> I’d like to undestand why I’m experiencing the following behavior and if
> it’s expected (as I
Is it possible that this is due to extra whitespace in the PDF?
On Sun, Jul 30, 2023 at 2:17 PM Keith Bennett
wrote:
> Hi, all. I am finally getting around to updating the "rika" Ruby gem for
> interacting with Tika in JRuby, and encountered something weird. When I
> test parsing a text file wit
Yes. Let us know if you find otherwise!
On Mon, Aug 14, 2023 at 11:16 AM Keith Bennett
wrote:
> Tim, thank you so much for responding. Can I rely on Content-Type to
> always be populated by a parse?
>
> - Keith
>
>
> On Mon, Aug 14, 2023 at 10:09 PM Tim Allison wrote:
>
Concur with Nick. And, y, I'd frankly copy the TesseractOCRParser into a
new module, rename it and modify it to call your OCR engine, build the jar
and add the dependency to your tika bin directory (if you're using Docker?).
On Thu, Aug 10, 2023 at 3:45 AM Cristian Zamfir
wrote:
> Hi Nick,
>
>
Content-Type may be more reliable/specific because for some file types, the
parser updates the file type during the parse. For example the PDF parser
updates application/pdf -> application/illustrator (or similar?) if the
parser determines that the file is a PDF-based Adobe Illustrator file. The
We use 17 in our docker images and one of our ci/cd pipelines uses 18. Are
you having problems?
On Tue, Jul 25, 2023 at 6:00 AM Slava G wrote:
> Hi,
> Silly question, what is the highest Java version that Tika 2.7.0 supports ?
>
> Thanks
>
es on the
-Xmx that's configured.
On Tue, Jul 25, 2023 at 7:43 AM Cristian Zamfir
wrote:
> Hello,
>
> On 21 Jul 2023 at 23:51:54, Cristian Zamfir wrote:
>
>> Hi Tim!
>>
>> Sorry for the lack of details, adding now.
>>
>> On 21 Jul 2023 at 18:56:02, T
1 - 100 of 680 matches
Mail list logo