Re: Enabling community around tika-helm
Thank you, Lewis! Add committers? On Sat, Nov 9, 2024 at 4:08 PM lewis john mcgibbney wrote: > Hi dev@, > > Over the last number of years, tika-helm [0] has been doing pretty well > however I see a problem. > > I had an issue (now fixed) and was not seeing any email activity so > essentially incoming contributions went ignored. I am addressing that this > weekend and will push a release however I want to ask the dev@ community > for input on enabling community building which would remove me (lewismc) as > a bottleneck/single point of failure. > > The contributions I refer to above did not come from existing Tika > Committers. > > Does anyone have suggestions? > > Thank you > lewismc > > [0] https://github.com/apache/tika-helm > > > Lewis J. McGibbney Ph.D >
[jira] [Updated] (TIKA-4345) Allow body-only content extraction for msg and other email formats
[ https://issues.apache.org/jira/browse/TIKA-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4345: -- Description: At least in the OutlookExtractor, we're writing some of the headers into the content stream. For some use cases, it would be helpful to extract only the body content into the content stream. Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers that need to be modified. We're not writing the from/to etc in the RFC822Parser into the content stream. I propose that this be a non-breaking/opt-in option in 3.x, and then the default in 4.x. In thinking about this more, I think we should get rid of injection of the header info into the content in msg files in 4.x. If users want it, we can add it back and do it correctly -- in .eml, outlook and pst. It is weird that we currently have it only msg. So, for 3.x, I propose that we allow users to turn this off in msg files. For 4.x, we just won't do it...unless someone opens a ticket. Let me know what you think/if there are any objections. was: At least in the OutlookExtractor, we're writing some of the headers into the content stream. For some use cases, it would be helpful to extract only the body content into the content stream. Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers that need to be modified. We're not writing the from/to etc in the RFC822Parser into the content stream. I propose that this be a non-breaking/opt-in option in 3.x, and then the default in 4.x. > Allow body-only content extraction for msg and other email formats > -- > > Key: TIKA-4345 > URL: https://issues.apache.org/jira/browse/TIKA-4345 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > > At least in the OutlookExtractor, we're writing some of the headers into the > content stream. For some use cases, it would be helpful to extract only the > body content into the content stream. > Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers > that need to be modified. We're not writing the from/to etc in the > RFC822Parser into the content stream. > I propose that this be a non-breaking/opt-in option in 3.x, and then the > default in 4.x. > In thinking about this more, I think we should get rid of injection of the > header info into the content in msg files in 4.x. If users want it, we can > add it back and do it correctly -- in .eml, outlook and pst. It is weird that > we currently have it only msg. > So, for 3.x, I propose that we allow users to turn this off in msg files. For > 4.x, we just won't do it...unless someone opens a ticket. > Let me know what you think/if there are any objections. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4345) Allow body-only content extraction for msg and other email formats
[ https://issues.apache.org/jira/browse/TIKA-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4345: -- Description: At least in the OutlookExtractor, we're writing some of the headers into the content stream. For some use cases, it would be helpful to extract only the body content into the content stream. Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers that need to be modified. We're not writing the from/to etc in the RFC822Parser into the content stream. I propose that this be a non-breaking/opt-in option in 3.x, and then the default in 4.x. was: At least in the OutlookExtractor, we're writing some of the headers into the content stream. For some use cases, it would be helpful to extract only the body content into the content stream. Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers that need to be modified. We're not writing the from/to etc in the RFC822Parser into the content stream. I propose that this be a non-breaking/opt-in option. > Allow body-only content extraction for msg and other email formats > -- > > Key: TIKA-4345 > URL: https://issues.apache.org/jira/browse/TIKA-4345 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > > At least in the OutlookExtractor, we're writing some of the headers into the > content stream. For some use cases, it would be helpful to extract only the > body content into the content stream. > Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers > that need to be modified. We're not writing the from/to etc in the > RFC822Parser into the content stream. > I propose that this be a non-breaking/opt-in option in 3.x, and then the > default in 4.x. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4345) Allow body-only content extraction for msg and other email formats
Tim Allison created TIKA-4345: - Summary: Allow body-only content extraction for msg and other email formats Key: TIKA-4345 URL: https://issues.apache.org/jira/browse/TIKA-4345 Project: Tika Issue Type: Task Reporter: Tim Allison At least in the OutlookParser, we're writing some of the headers into the content stream. For some use cases, it would be helpful to extract only the body content into the content stream. I propose that this be a non-breaking/opt-in option. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4345) Allow body-only content extraction for msg and other email formats
[ https://issues.apache.org/jira/browse/TIKA-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4345: -- Description: At least in the OutlookExtractor, we're writing some of the headers into the content stream. For some use cases, it would be helpful to extract only the body content into the content stream. Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers that need to be modified. We're not writing the from/to etc in the RFC822Parser into the content stream. I propose that this be a non-breaking/opt-in option. was: At least in the OutlookParser, we're writing some of the headers into the content stream. For some use cases, it would be helpful to extract only the body content into the content stream. I propose that this be a non-breaking/opt-in option. > Allow body-only content extraction for msg and other email formats > -- > > Key: TIKA-4345 > URL: https://issues.apache.org/jira/browse/TIKA-4345 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > > At least in the OutlookExtractor, we're writing some of the headers into the > content stream. For some use cases, it would be helpful to extract only the > body content into the content stream. > Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers > that need to be modified. We're not writing the from/to etc in the > RFC822Parser into the content stream. > I propose that this be a non-breaking/opt-in option. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4344) Add wrapper for magika detector
[ https://issues.apache.org/jira/browse/TIKA-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4344. --- Fix Version/s: 4.0.0 3.1.0 Resolution: Fixed > Add wrapper for magika detector > --- > > Key: TIKA-4344 > URL: https://issues.apache.org/jira/browse/TIKA-4344 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Minor > Fix For: 4.0.0, 3.1.0 > > > https://github.com/google/magika > See also: https://www.youtube.com/watch?v=PBbld8xB2Bo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4344) Add wrapper for magika detector
Tim Allison created TIKA-4344: - Summary: Add wrapper for magika detector Key: TIKA-4344 URL: https://issues.apache.org/jira/browse/TIKA-4344 Project: Tika Issue Type: Task Reporter: Tim Allison https://github.com/google/magika See also: https://www.youtube.com/watch?v=PBbld8xB2Bo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4337) Improvements to recent xps mods
[ https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4337. --- Fix Version/s: 4.0.0 3.1.0 Resolution: Fixed > Improvements to recent xps mods > --- > > Key: TIKA-4337 > URL: https://issues.apache.org/jira/browse/TIKA-4337 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Minor > Fix For: 4.0.0, 3.1.0 > > Attachments: xps-reports.tgz > > > I pulled 249 xps files out of the latest commoncrawl crawl and compared > 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few > number format exceptions where a comma-delimited string is parsed as if it > were an integer. > Reports are attached. See esp. new_exceptions_in_b_details.xlsx and > content_diffs_no_exceptions.xlsx. > The source files are available here: > https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4337) Improvements to recent xps mods
[ https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894629#comment-17894629 ] Tim Allison commented on TIKA-4337: --- Y, I completely agree about the "opportunistic improvement." I think this could be an area for future work, but it is not applicable broadly. The licenses for those files are definitely not Apache 2.0 compliant... so we can't include them directly n our unit tests. :( However, I could put them in our regression corpus, and we'd see changes whenever we run large scale regression testing before a release. This is not ideal, but is the best we can do. Do any fellow devs ([~tilman] [~nick] ?) know if we could try to download the files as part of the build process and then incorporate local copies into unit tests? I know PDFBox downloads some files for unit tests, but I don't know what they're licensing is... Or does this go against the spirit of the Apache license? > Improvements to recent xps mods > --- > > Key: TIKA-4337 > URL: https://issues.apache.org/jira/browse/TIKA-4337 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Attachments: xps-reports.tgz > > > I pulled 249 xps files out of the latest commoncrawl crawl and compared > 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few > number format exceptions where a comma-delimited string is parsed as if it > were an integer. > Reports are attached. See esp. new_exceptions_in_b_details.xlsx and > content_diffs_no_exceptions.xlsx. > The source files are available here: > https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4343) Remove agepredictor in 4.x
Tim Allison created TIKA-4343: - Summary: Remove agepredictor in 4.x Key: TIKA-4343 URL: https://issues.apache.org/jira/browse/TIKA-4343 Project: Tika Issue Type: Task Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4341) Fix deserialization of MetadataListFilter
[ https://issues.apache.org/jira/browse/TIKA-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4341. --- Fix Version/s: 4.0.0 3.1.0 Resolution: Fixed > Fix deserialization of MetadataListFilter > - > > Key: TIKA-4341 > URL: https://issues.apache.org/jira/browse/TIKA-4341 > Project: Tika > Issue Type: Bug > Reporter: Tim Allison >Priority: Trivial > Fix For: 4.0.0, 3.1.0 > > > MetadataListFilter in its {{load(_)}} should expect children of type > MetadataListFilter, not MetadataFilter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4342) Remove tika-batch from tika-eval's FileProfiler
Tim Allison created TIKA-4342: - Summary: Remove tika-batch from tika-eval's FileProfiler Key: TIKA-4342 URL: https://issues.apache.org/jira/browse/TIKA-4342 Project: Tika Issue Type: Sub-task Reporter: Tim Allison FileProfiler is the simplest handler. Let's start there. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4340) Remove tika-batch from tika-app
[ https://issues.apache.org/jira/browse/TIKA-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4340. --- Resolution: Fixed > Remove tika-batch from tika-app > --- > > Key: TIKA-4340 > URL: https://issues.apache.org/jira/browse/TIKA-4340 > Project: Tika > Issue Type: Sub-task > Reporter: Tim Allison >Priority: Major > > Remove tika-batch option from tika-app and support translating basic > commandline args into a call to tika-pipes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4341) Fix deserialization of MetadataListFilter
Tim Allison created TIKA-4341: - Summary: Fix deserialization of MetadataListFilter Key: TIKA-4341 URL: https://issues.apache.org/jira/browse/TIKA-4341 Project: Tika Issue Type: Bug Reporter: Tim Allison MetadataListFilter in its {{load(_)}} should expect children of type MetadataListFilter, not MetadataFilter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Apache Tika in the ASF project spotlight
https://news.apache.org/foundation/entry/asf-project-spotlight-apache-tika
[jira] [Commented] (TIKA-4314) CompositeParser returns only one parser per content type
[ https://issues.apache.org/jira/browse/TIKA-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893842#comment-17893842 ] Tim Allison commented on TIKA-4314: --- Sorry for dropping the ball on this. The SupplementingParser is definitely the way to go with this. I looked at it this morning and we haven't wired up the serialization/configuration so that you can easily specify which component parsers go into that Parser. If we did that, then we would be able to configure {{o.a.t.p.external2.ExternalParser}}s for each commandline you wanted. If you don't need to configure this via xml, e.g. you're running Tika programmatically, this should be not too hard. > CompositeParser returns only one parser per content type > > > Key: TIKA-4314 > URL: https://issues.apache.org/jira/browse/TIKA-4314 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.9.2 >Reporter: Leszek Sliwko >Priority: Major > Attachments: duration-test-2.avi, geolocation-test-1.jpg, > geolocation-test-2.jpg > > > External parsers can have many supported content types, but information is > lost in CompositeParser: > > public Map getParsers(ParseContext context) { > Map map = new HashMap<>(); > for (Parser parser : parsers) { > for (MediaType type : parser.getSupportedTypes(context)) > { map.put(registry.normalize(type), parser); } > } > return map; > } > > To recreate - parse any avi file (content type: video/x-msvideo), Only the > exiftool will by picked up and the ffmpeg parser won't be executed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4314) CompositeParser returns only one parser per content type
[ https://issues.apache.org/jira/browse/TIKA-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893880#comment-17893880 ] Tim Allison commented on TIKA-4314: --- Got it. Thank you. I like what you've done. There are a few challenges with this route. The default legacy ExternalParser that is loaded by TikaConfig or by default is a CompositeParser. If we change the behavior of the CompositeParser, that will have unintended consequences on other combinations of parsers. The basic design in Tika is one parser per file type. Another issue is that this relies on the legacy ExternalParser which wraps a number of external parsers. We're moving towards the more robust and flexible {{o.a.t.p.external2.ExternalParser}}. So, I don't think we want to have such a major change hinging on something that will be deprecated in 3.x and removed by 4.x (maybe? depending on community discussions/feedback). I think it would be much better to use the SupplementingParser, and have it wrap the ExternalParsers that you want. If we head in this direction, what will it take to get this working for you? Are you able to configure your parsers programmatically, or are you using tika-server or something else where you need to configure the parsers via tika-config.xml? > CompositeParser returns only one parser per content type > > > Key: TIKA-4314 > URL: https://issues.apache.org/jira/browse/TIKA-4314 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.9.2 >Reporter: Leszek Sliwko >Priority: Major > Attachments: CompositeParser.java, duration-test-2.avi, > geolocation-test-1.jpg, geolocation-test-2.jpg > > > External parsers can have many supported content types, but information is > lost in CompositeParser: > > public Map getParsers(ParseContext context) { > Map map = new HashMap<>(); > for (Parser parser : parsers) { > for (MediaType type : parser.getSupportedTypes(context)) > { map.put(registry.normalize(type), parser); } > } > return map; > } > > To recreate - parse any avi file (content type: video/x-msvideo), Only the > exiftool will by picked up and the ffmpeg parser won't be executed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4340) Remove tika-batch from tika-app
Tim Allison created TIKA-4340: - Summary: Remove tika-batch from tika-app Key: TIKA-4340 URL: https://issues.apache.org/jira/browse/TIKA-4340 Project: Tika Issue Type: Sub-task Reporter: Tim Allison Remove tika-batch option from tika-app and support translating basic commandline args into a call to tika-pipes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4322) Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17
[ https://issues.apache.org/jira/browse/TIKA-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893625#comment-17893625 ] Tim Allison commented on TIKA-4322: --- I just updated the Jenkins main-jdk17 to push to the snapshot repo. > Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17 > -- > > Key: TIKA-4322 > URL: https://issues.apache.org/jira/browse/TIKA-4322 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4337) Improvements to recent xps mods
[ https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893603#comment-17893603 ] Tim Allison commented on TIKA-4337: --- This is relevant for text extraction. I am NOT suggesting that you implement this, though. It looks like xps also has structure/tags/structure info like PDF does to try to group text pieces logically. If you look at: 10b9b1c63da0c725f74256f22bbd4956a64b35cea3edc6ab6a43eeb7710888d6, there's a Structure directory under Document, and in the Fragments subdir, there are lists of which text runs should be in the same paragraph. {code:xml} {code} > Improvements to recent xps mods > --- > > Key: TIKA-4337 > URL: https://issues.apache.org/jira/browse/TIKA-4337 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Attachments: xps-reports.tgz > > > I pulled 249 xps files out of the latest commoncrawl crawl and compared > 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few > number format exceptions where a comma-delimited string is parsed as if it > were an integer. > Reports are attached. See esp. new_exceptions_in_b_details.xlsx and > content_diffs_no_exceptions.xlsx. > The source files are available here: > https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4337) Improvements to recent xps mods
[ https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893594#comment-17893594 ] Tim Allison commented on TIKA-4337: --- Again, unrelated to the text extraction work -- this is just something... It looks like xps "stripes" images just like some PDF creators do/used to do. See: a28ab64ba223643c6a30d542deb543e2ea3acac911f04a7784c9d3d9f583df01 > Improvements to recent xps mods > --- > > Key: TIKA-4337 > URL: https://issues.apache.org/jira/browse/TIKA-4337 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Attachments: xps-reports.tgz > > > I pulled 249 xps files out of the latest commoncrawl crawl and compared > 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few > number format exceptions where a comma-delimited string is parsed as if it > were an integer. > Reports are attached. See esp. new_exceptions_in_b_details.xlsx and > content_diffs_no_exceptions.xlsx. > The source files are available here: > https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4337) Improvements to recent xps mods
[ https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893589#comment-17893589 ] Tim Allison commented on TIKA-4337: --- [~ruairidh-next], this is totally unrelated to the text extraction code that you're working on, but if you come across attachments in any of these files, please let me know. I don't even know if xps allows it (it must?!), and I plan to do my own analysis on the recent batch of files. > Improvements to recent xps mods > --- > > Key: TIKA-4337 > URL: https://issues.apache.org/jira/browse/TIKA-4337 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Attachments: xps-reports.tgz > > > I pulled 249 xps files out of the latest commoncrawl crawl and compared > 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few > number format exceptions where a comma-delimited string is parsed as if it > were an integer. > Reports are attached. See esp. new_exceptions_in_b_details.xlsx and > content_diffs_no_exceptions.xlsx. > The source files are available here: > https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4337) Improvements to recent xps mods
[ https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893518#comment-17893518 ] Tim Allison commented on TIKA-4337: --- Fantastic. Thank you! > Improvements to recent xps mods > --- > > Key: TIKA-4337 > URL: https://issues.apache.org/jira/browse/TIKA-4337 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Minor > Attachments: xps-reports.tgz > > > I pulled 249 xps files out of the latest commoncrawl crawl and compared > 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few > number format exceptions where a comma-delimited string is parsed as if it > were an integer. > Reports are attached. See esp. new_exceptions_in_b_details.xlsx and > content_diffs_no_exceptions.xlsx. > The source files are available here: > https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4338) Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module
[ https://issues.apache.org/jira/browse/TIKA-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892818#comment-17892818 ] Tim Allison commented on TIKA-4338: --- Thank you for opening this issue [~sandeep_kulkarni]. > Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module > -- > > Key: TIKA-4338 > URL: https://issues.apache.org/jira/browse/TIKA-4338 > Project: Tika > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Sandeep Kulkarni >Priority: Major > Fix For: 4.0.0, 3.1.0 > > > As per the release notes for Tika 3.0.0, TagSoup is mentioned as replaced > with JSoup. I had requested for its removal earlier in TIKA-4109. > So I integrated Tika 3.0.0 and found that TagSoup is still shown as one of > the dependency component of tika-parser-code-module. It seems to be only > removed from tika-parser-html-module. > So is it possible to completely get rid of TagSoup from Tika as it is EOL? > tika-parser-code-module has dependency of > *org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
Next 3.x should be 3.1.0?
All, In looking already at the mods to the 3.x branch, I think the next release will be a minor revision, not a patch. In short, I think we should go w 3.1.0 for the next 3.x release. Let me know if you disagree. Best, Tim
[jira] [Resolved] (TIKA-4338) Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module
[ https://issues.apache.org/jira/browse/TIKA-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4338. --- Fix Version/s: 3.1.0 4.0.0 Assignee: (was: Tim Allison) Resolution: Fixed > Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module > -- > > Key: TIKA-4338 > URL: https://issues.apache.org/jira/browse/TIKA-4338 > Project: Tika > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Sandeep Kulkarni >Priority: Major > Fix For: 3.1.0, 4.0.0 > > > As per the release notes for Tika 3.0.0, TagSoup is mentioned as replaced > with JSoup. I had requested for its removal earlier in TIKA-4109. > So I integrated Tika 3.0.0 and found that TagSoup is still shown as one of > the dependency component of tika-parser-code-module. It seems to be only > removed from tika-parser-html-module. > So is it possible to completely get rid of TagSoup from Tika as it is EOL? > tika-parser-code-module has dependency of > *org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (TIKA-4339) Wrong mimetype for font
[ https://issues.apache.org/jira/browse/TIKA-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-4339: - Assignee: Tim Allison > Wrong mimetype for font > --- > > Key: TIKA-4339 > URL: https://issues.apache.org/jira/browse/TIKA-4339 > Project: Tika > Issue Type: Bug > Components: mime >Reporter: Gustavo de Oliveira Silva > Assignee: Tim Allison >Priority: Minor > Attachments: suggestion.diff > > > The current font mimetype are `application/x-font-otf` and > `application/x-font-ttf` > They are this since 2009. But with RFC8081, IANA added the type `font/*` for > fonts. > IANA: [https://www.iana.org/assignments/media-types/media-types.xhtml#font] > RFC8081: [https://www.rfc-editor.org/rfc/rfc8081.html] > > Almost only change `application/x-font-otf` to `font/otf` and > `application/x-font-ttf` to `font/ttf`. > The attached file is a suggestion -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4339) Wrong mimetype for font
[ https://issues.apache.org/jira/browse/TIKA-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892812#comment-17892812 ] Tim Allison commented on TIKA-4339: --- Thank you for opening this issue. A few questions... Do we leave the other fonts as is? There are a bunch of other font types. What do we do with those? I see woff in the spec, but I don't think we handle that. As a separate issue, should we add woff and woff2 detection? Is this a breaking enough change that we should keep it only in the 4.x branch and not make the change in the 3.x branch? > Wrong mimetype for font > --- > > Key: TIKA-4339 > URL: https://issues.apache.org/jira/browse/TIKA-4339 > Project: Tika > Issue Type: Bug > Components: mime >Reporter: Gustavo de Oliveira Silva >Assignee: Tim Allison >Priority: Minor > Attachments: suggestion.diff > > > The current font mimetype are `application/x-font-otf` and > `application/x-font-ttf` > They are this since 2009. But with RFC8081, IANA added the type `font/*` for > fonts. > IANA: [https://www.iana.org/assignments/media-types/media-types.xhtml#font] > RFC8081: [https://www.rfc-editor.org/rfc/rfc8081.html] > > Almost only change `application/x-font-otf` to `font/otf` and > `application/x-font-ttf` to `font/ttf`. > The attached file is a suggestion -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (TIKA-4338) Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module
[ https://issues.apache.org/jira/browse/TIKA-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-4338: - Assignee: Tim Allison > Remove use of EOL component TagSoup 1.2.1 from tika-parser-code-module > -- > > Key: TIKA-4338 > URL: https://issues.apache.org/jira/browse/TIKA-4338 > Project: Tika > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Sandeep Kulkarni > Assignee: Tim Allison >Priority: Major > > As per the release notes for Tika 3.0.0, TagSoup is mentioned as replaced > with JSoup. I had requested for its removal earlier in TIKA-4109. > So I integrated Tika 3.0.0 and found that TagSoup is still shown as one of > the dependency component of tika-parser-code-module. It seems to be only > removed from tika-parser-html-module. > So is it possible to completely get rid of TagSoup from Tika as it is EOL? > tika-parser-code-module has dependency of > *org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4337) Improvements to recent xps mods
[ https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892588#comment-17892588 ] Tim Allison edited comment on TIKA-4337 at 10/24/24 6:43 PM: - cc [~ruairidh-next] ... no good deed goes unpunished. :D You've done plenty. I can make these fixes. If you're curious, though, please do take a look. was (Author: talli...@mitre.org): cc [~ruairidh-next] ... no good deed goes unpunished. :D > Improvements to recent xps mods > --- > > Key: TIKA-4337 > URL: https://issues.apache.org/jira/browse/TIKA-4337 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Attachments: xps-reports.tgz > > > I pulled 249 xps files out of the latest commoncrawl crawl and compared > 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few > number format exceptions where a comma-delimited string is parsed as if it > were an integer. > Reports are attached. See esp. new_exceptions_in_b_details.xlsx and > content_diffs_no_exceptions.xlsx. > The source files are available here: > https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4337) Improvements to recent xps mods
[ https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892588#comment-17892588 ] Tim Allison commented on TIKA-4337: --- cc [~ruairidh-next] ... no good deed goes unpunished. :D > Improvements to recent xps mods > --- > > Key: TIKA-4337 > URL: https://issues.apache.org/jira/browse/TIKA-4337 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Minor > Attachments: xps-reports.tgz > > > I pulled 249 xps files out of the latest commoncrawl crawl and compared > 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few > number format exceptions where a comma-delimited string is parsed as if it > were an integer. > Reports are attached. See esp. new_exceptions_in_b_details.xlsx and > content_diffs_no_exceptions.xlsx. > The source files are available here: > https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4337) Improvements to recent xps mods
Tim Allison created TIKA-4337: - Summary: Improvements to recent xps mods Key: TIKA-4337 URL: https://issues.apache.org/jira/browse/TIKA-4337 Project: Tika Issue Type: Task Reporter: Tim Allison Attachments: xps-reports.tgz I pulled 249 xps files out of the latest commoncrawl crawl and compared 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few number format exceptions where a comma-delimited string is parsed as if it were an integer. Reports are attached. See esp. new_exceptions_in_b_details.xlsx and content_diffs_no_exceptions.xlsx. The source files are available here: https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4336) 'application/json' is not instance of 'text/plain'
[ https://issues.apache.org/jira/browse/TIKA-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4336. --- Fix Version/s: 3.0.1 4.0.0 Resolution: Fixed Thank you [~valfirst]! > 'application/json' is not instance of 'text/plain' > -- > > Key: TIKA-4336 > URL: https://issues.apache.org/jira/browse/TIKA-4336 > Project: Tika > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Valery Yatsynovich >Priority: Major > Fix For: 3.0.1, 4.0.0 > > > {{MediaTypeRegistry.getDefaultRegistry().isInstanceOf("application/json", > MediaType.TEXT_PLAIN)}} > => true on version {{2.9.2}} > => false on version {{3.0.0}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4315) XPS file parser does not emit whitespace as expected
[ https://issues.apache.org/jira/browse/TIKA-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4315. --- Fix Version/s: 2.9.3 3.0.1 4.0.0 Resolution: Fixed Thank you [~ruairidh-next]! > XPS file parser does not emit whitespace as expected > > > Key: TIKA-4315 > URL: https://issues.apache.org/jira/browse/TIKA-4315 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1, 2.9.2 >Reporter: Ruairidh Williamson >Priority: Major > Fix For: 2.9.3, 3.0.1, 4.0.0 > > Attachments: testXLSX.xps > > > We are using tika to extract text from XPS files and have hit an issue where > whitespace is not emitted where we would expect. See the attached example > file where opening the file it visually has a large gap between "x" and > "abcde1234f" but when extracted by tika it calls `characters` with "x" and > then `characters` on "abcde1234f". We would expect a `ignorableWhitespace` in > between those calls but we don't get one. > I have a pull request that fixes the issue which I will submit. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4336) 'application/json' is not instance of 'text/plain'
[ https://issues.apache.org/jira/browse/TIKA-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892119#comment-17892119 ] Tim Allison commented on TIKA-4336: --- TIKA-4119 changed {{application/javascript}} -> {{text/javascript}}. json is a subclass of {{application/javascript}} so the lookup is now missing the path to {{text/plain}}. We can modify the subclass of json to {{text/javascript}} and we should be good to go. If you have time, a PR with a new unit test based on your code above would help move this more quickly. Thank you! > 'application/json' is not instance of 'text/plain' > -- > > Key: TIKA-4336 > URL: https://issues.apache.org/jira/browse/TIKA-4336 > Project: Tika > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Valery Yatsynovich >Priority: Major > > {{MediaTypeRegistry.getDefaultRegistry().isInstanceOf("application/json", > MediaType.TEXT_PLAIN)}} > => true on version {{2.9.2}} > => false on version {{3.0.0}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4330) Add a MetadataListFilter
[ https://issues.apache.org/jira/browse/TIKA-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4330. --- Fix Version/s: 4.0.0 3.0.1 Resolution: Fixed > Add a MetadataListFilter > > > Key: TIKA-4330 > URL: https://issues.apache.org/jira/browse/TIKA-4330 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Minor > Fix For: 4.0.0, 3.0.1 > > > We currently have MetadataFilters that operate on a single metadata instance. > There are some use cases where a filter on one metadata instance in the list > needs access to other information in the other metadata objects in the list. > The simplest use case for this would be to populate an "attachment_count" > metadata field in the parent's metadata object. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4335) Refactor tika-server to avoid a shaded/fat jar in 4.x
Tim Allison created TIKA-4335: - Summary: Refactor tika-server to avoid a shaded/fat jar in 4.x Key: TIKA-4335 URL: https://issues.apache.org/jira/browse/TIKA-4335 Project: Tika Issue Type: Task Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4333) Remove tika-batch from 4.x/main
Tim Allison created TIKA-4333: - Summary: Remove tika-batch from 4.x/main Key: TIKA-4333 URL: https://issues.apache.org/jira/browse/TIKA-4333 Project: Tika Issue Type: Task Components: tika-batch Reporter: Tim Allison Move batch processing in tika-app and tika-eval to tika-pipes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4334) Move tika pipes components in tika-core to tika-pipes-core in 4.x
Tim Allison created TIKA-4334: - Summary: Move tika pipes components in tika-core to tika-pipes-core in 4.x Key: TIKA-4334 URL: https://issues.apache.org/jira/browse/TIKA-4334 Project: Tika Issue Type: Task Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4332) Consider removing dotnet module in 4.x/main
Tim Allison created TIKA-4332: - Summary: Consider removing dotnet module in 4.x/main Key: TIKA-4332 URL: https://issues.apache.org/jira/browse/TIKA-4332 Project: Tika Issue Type: Task Reporter: Tim Allison The dotnet module hasn't been updated since 1.11. Unless there are objections, let's remove it in 4.x. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4331) Bump tika-docker to ubuntu:oracular?
Tim Allison created TIKA-4331: - Summary: Bump tika-docker to ubuntu:oracular? Key: TIKA-4331 URL: https://issues.apache.org/jira/browse/TIKA-4331 Project: Tika Issue Type: Task Components: tika-docker Reporter: Tim Allison Should we bump the base image to oracular? I don't know enough about the diffs from noble. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4330) Add a MetadataListFilter
Tim Allison created TIKA-4330: - Summary: Add a MetadataListFilter Key: TIKA-4330 URL: https://issues.apache.org/jira/browse/TIKA-4330 Project: Tika Issue Type: Task Reporter: Tim Allison We currently have MetadataFilters that operate on a single metadata instance. There are some use cases where a filter on one metadata instance in the list needs access to other information in the other metadata objects in the list. The simplest use case for this would be to populate an "attachment_count" metadata field in the parent's metadata object. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4329) Release tika-3.0.0's docker image
[ https://issues.apache.org/jira/browse/TIKA-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4329. --- Fix Version/s: 3.0.0 Resolution: Fixed > Release tika-3.0.0's docker image > - > > Key: TIKA-4329 > URL: https://issues.apache.org/jira/browse/TIKA-4329 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > Also bump jre to 21? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4329) Release tika-3.0.0's docker image
Tim Allison created TIKA-4329: - Summary: Release tika-3.0.0's docker image Key: TIKA-4329 URL: https://issues.apache.org/jira/browse/TIKA-4329 Project: Tika Issue Type: Task Reporter: Tim Allison Also bump jre to 21? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4326) General updates for 3.0.1
[ https://issues.apache.org/jira/browse/TIKA-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891536#comment-17891536 ] Tim Allison edited comment on TIKA-4326 at 10/21/24 1:10 PM: - Thank you [~tilman]. I just added jdk23 github action builds to branch_3x and main. was (Author: talli...@mitre.org): Thank you [~tilman]. I just added jdk23 builds to branch_3x and main. > General updates for 3.0.1 > - > > Key: TIKA-4326 > URL: https://issues.apache.org/jira/browse/TIKA-4326 > Project: Tika > Issue Type: Task >Reporter: Tilman Hausherr >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4326) General updates for 3.0.1
[ https://issues.apache.org/jira/browse/TIKA-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891536#comment-17891536 ] Tim Allison commented on TIKA-4326: --- Thank you [~tilman]. I just added jdk23 builds to branch_3x and main. > General updates for 3.0.1 > - > > Key: TIKA-4326 > URL: https://issues.apache.org/jira/browse/TIKA-4326 > Project: Tika > Issue Type: Task >Reporter: Tilman Hausherr >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4328) Update or remove tika-deployment snaps
Tim Allison created TIKA-4328: - Summary: Update or remove tika-deployment snaps Key: TIKA-4328 URL: https://issues.apache.org/jira/browse/TIKA-4328 Project: Tika Issue Type: Task Reporter: Tim Allison These haven't been updated since 2.0.0-SNAPSHOT, apparently. I doubt anyone is using them. We should either update them or remove them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4322) Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17
[ https://issues.apache.org/jira/browse/TIKA-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891194#comment-17891194 ] Tim Allison commented on TIKA-4322: --- I just updated Jenkins a bit. I removed all the {{branch_1x}}, and I deleted {{-jdk11}} from main. I added {{tika-branch_3x-*}} jdks. There will likely be surprises, but this should be a decent start. > Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17 > -- > > Key: TIKA-4322 > URL: https://issues.apache.org/jira/browse/TIKA-4322 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4322) Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17
[ https://issues.apache.org/jira/browse/TIKA-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891194#comment-17891194 ] Tim Allison edited comment on TIKA-4322 at 10/19/24 8:17 PM: - I just updated Jenkins a bit. I removed all the {{branch_1x}}, and I deleted {{-jdk11}} from main. I left in the disabled {{tika-branch_2x-jdk8}} as a reminder that that doesn't work. We can delete it in 6 months...if all goes well. :D I added {{tika-branch_3x-*}} jdks. There will likely be surprises, but this should be a decent start. was (Author: talli...@mitre.org): I just updated Jenkins a bit. I removed all the {{branch_1x}}, and I deleted {{-jdk11}} from main. I added {{tika-branch_3x-*}} jdks. There will likely be surprises, but this should be a decent start. > Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17 > -- > > Key: TIKA-4322 > URL: https://issues.apache.org/jira/browse/TIKA-4322 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4325) Consider removing some unsupported modules in 4.x
[ https://issues.apache.org/jira/browse/TIKA-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4325: -- Summary: Consider removing some unsupported modules in 4.x (was: Consider removing some unsupported modules) > Consider removing some unsupported modules in 4.x > - > > Key: TIKA-4325 > URL: https://issues.apache.org/jira/browse/TIKA-4325 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Minor > > I propose removing tika-age-recogniser and tika-dl in 4.x. They'll still be > available in 3.x for at least a year. Any objections? > Are there other modules that we'd like to remove? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4247) HttpFetcher - add ability to send request headers
[ https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4247: -- Fix Version/s: 3.0.1 (was: 3.0.0) > HttpFetcher - add ability to send request headers > - > > Key: TIKA-4247 > URL: https://issues.apache.org/jira/browse/TIKA-4247 > Project: Tika > Issue Type: New Feature >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.1 > > > add ability to send request headers -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4247) HttpFetcher - add ability to send request headers
[ https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4247. --- Resolution: Fixed I think fixed this including the ParseContext in the FetchEmitTuple? Apologies if I've confused issues. Please re-open if I'm wrong. > HttpFetcher - add ability to send request headers > - > > Key: TIKA-4247 > URL: https://issues.apache.org/jira/browse/TIKA-4247 > Project: Tika > Issue Type: New Feature >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > add ability to send request headers -- This message was sent by Atlassian Jira (v8.20.10#820010)
[ANNOUNCE] Apache Tika 3.0.0 released
The Apache Tika project is pleased to announce the release of Apache Tika 3.0.0. The release contents have been pushed out to the main Apache release site and to the Maven Central sync. Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 3.0.0 includes numerous bug fixes and dependency upgrades. The biggest change in the 3.x branch is that it requires >= Java 11. Details can be found in the changes file: https://www.apache.org/dist/tika/3.0.0/CHANGES-3.0.0.txt Apache Tika is available on the download page: https://tika.apache.org/download.html Apache Tika will be available shortly in binary form or for use using Maven 2 from the Central Repository: https://repo1.maven.org/maven2/org/apache/tika/ When downloading, please remember to verify the downloads using signatures found: https://www.apache.org/dist/tika/KEYS For more information on Apache Tika, visit the project home page: https://tika.apache.org/ NOTE: This release requires Java 11. We plan to support the 2.x branch (which requires Java 8) for six months after the release of 3.0.0. See: https://cwiki.apache.org/confluence/display/TIKA/Tika+Roadmap+--+2.x%2C+3.x+and+Beyond -- Tim Allison, on behalf of the Apache Tika community
[jira] [Updated] (TIKA-1907) Big Pdf parsing to text - Out of memory
[ https://issues.apache.org/jira/browse/TIKA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1907: -- Fix Version/s: 3.0.1 (was: 3.0.0) > Big Pdf parsing to text - Out of memory > --- > > Key: TIKA-1907 > URL: https://issues.apache.org/jira/browse/TIKA-1907 > Project: Tika > Issue Type: Bug >Affects Versions: 1.12 >Reporter: Nicolas Daniels >Priority: Major > Fix For: 3.0.1 > > > Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284] > I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe > PDFBox is not the appropriate lib to use in such case. > Trying to read the same PDF using Tika leads to the same problem: > {code:title=Test.java|borderStyle=solid} > @Test > public void testParsePdf_Content_Memory() throws Exception { > { > InputStream inputStream = new > FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf"); > try { > StringWriter writer = new StringWriter(); >FileWriter fileWriter = new FileWriter(new > File("c:/tmp/test.txt")); > BodyContentHandler handler = new BodyContentHandler(fileWriter); > Metadata metadata = new Metadata(); > new PDFParser().parse(inputStream, handler, metadata, new > ParseContext()); > fileWriter.close(); > } finally { > inputStream.close(); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4323) Consider removing 1.x info from site?
Tim Allison created TIKA-4323: - Summary: Consider removing 1.x info from site? Key: TIKA-4323 URL: https://issues.apache.org/jira/browse/TIKA-4323 Project: Tika Issue Type: Task Reporter: Tim Allison Do we need to keep the javadocs etc for the 1.x branch which has been EOL'd since September 2022? Unless there are objections, I'll slim down our site to include only 2.x and above. I certainly understand if we need to keep them around for archival purposes. Let me know. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4247) HttpFetcher - add ability to send request headers
[ https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4247: -- Fix Version/s: 3.0.0 (was: 3.0.1) > HttpFetcher - add ability to send request headers > - > > Key: TIKA-4247 > URL: https://issues.apache.org/jira/browse/TIKA-4247 > Project: Tika > Issue Type: New Feature >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > add ability to send request headers -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4298) Failed to detect charset for zip entry with short non-Unicode file name
[ https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891179#comment-17891179 ] Tim Allison commented on TIKA-4298: --- This was resolved in 3.0.0 and should be closed? > Failed to detect charset for zip entry with short non-Unicode file name > --- > > Key: TIKA-4298 > URL: https://issues.apache.org/jira/browse/TIKA-4298 > Project: Tika > Issue Type: Bug > Components: detector >Reporter: Mingchun Zhao >Priority: Major > Fix For: 2.9.3, 3.0.1 > > Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip > > > The Japanese file names extracted from a zip file > [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file > name is Shift_JIS, but the detect() method within the PackageParser class was > not able to detect the charset properly. > {code:java} > $ ls -1 testZipEntryNameCharsetShiftSJIS > shiba.png > 文章1.txt > 文章2.txt > {code} > {code:java} > $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip > xmlns="http://www.w3.org/1999/xhtml";> > > > content="org.apache.tika.parser.pkg.PackageParser"/> > > > > > > > > > shiba.png > > > ���1.txt > あいうえお > かきくけこ > > > ���2.txt > さしすせそ > たちつてと > > % {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4318) Fix javadoc aggregate in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4318: -- Fix Version/s: 3.0.1 (was: 3.0.0) > Fix javadoc aggregate in 3.x > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.1 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc > aggregate}}. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4325) Consider removing some unsupported modules
Tim Allison created TIKA-4325: - Summary: Consider removing some unsupported modules Key: TIKA-4325 URL: https://issues.apache.org/jira/browse/TIKA-4325 Project: Tika Issue Type: Task Reporter: Tim Allison I propose removing tika-age-recogniser and tika-dl in 4.x. They'll still be available in 3.x for at least a year. Any objections? Are there other modules that we'd like to remove? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4324) Update dependencies in main that require Java 17
Tim Allison created TIKA-4324: - Summary: Update dependencies in main that require Java 17 Key: TIKA-4324 URL: https://issues.apache.org/jira/browse/TIKA-4324 Project: Tika Issue Type: Task Reporter: Tim Allison There are a handful of dependencies whose most recent versions require java 17. We can now make those updates in {{main}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4298) Failed to detect charset for zip entry with short non-Unicode file name
[ https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4298: -- Fix Version/s: 3.0.1 (was: 3.0.0) > Failed to detect charset for zip entry with short non-Unicode file name > --- > > Key: TIKA-4298 > URL: https://issues.apache.org/jira/browse/TIKA-4298 > Project: Tika > Issue Type: Bug > Components: detector >Reporter: Mingchun Zhao >Priority: Major > Fix For: 2.9.3, 3.0.1 > > Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip > > > The Japanese file names extracted from a zip file > [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file > name is Shift_JIS, but the detect() method within the PackageParser class was > not able to detect the charset properly. > {code:java} > $ ls -1 testZipEntryNameCharsetShiftSJIS > shiba.png > 文章1.txt > 文章2.txt > {code} > {code:java} > $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip > xmlns="http://www.w3.org/1999/xhtml";> > > > content="org.apache.tika.parser.pkg.PackageParser"/> > > > > > > > > > shiba.png > > > ���1.txt > あいうえお > かきくけこ > > > ���2.txt > さしすせそ > たちつてと > > % {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4322) Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17
Tim Allison created TIKA-4322: - Summary: Create branch_3x and update main to 4.0.0-SNAPSHOT and Java 17 Key: TIKA-4322 URL: https://issues.apache.org/jira/browse/TIKA-4322 Project: Tika Issue Type: Task Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.20.10#820010)
[RESULT] [VOTE] Release Apache Tika 3.0.0 Candidate #1
The vote has passed with 4 binding +1s and no -1s. +1s Nicholas DiPiazza Oleg Tikhonov Tilman Hausherr Tim Allison I'll update the website and release the artifacts in the next few days. Thank you, all! Best, Tim On Wed, Oct 16, 2024 at 7:24 AM Tim Allison wrote: > > A candidate for the Tika 3.0.0 release is available at: > https://dist.apache.org/repos/dist/dev/tika/3.0.0 > > The release candidate is a zip archive of the sources in: > https://github.com/apache/tika/tree/3.0.0-rc1/ > > The SHA-512 checksum of the archive is > c5eb92bc895d96492b2d2577d14df6187e46ab7c8a9f64aaf19d4f140f07caf1223d073c2cbb47b5519bb952eee50f39563004b8ad49906f45dffc9b6df74350. > > In addition, a staged maven repository is available here: > https://repository.apache.org/content/repositories/orgapachetika-1107/org/apache/tika > > Please vote on releasing this package as Apache Tika 3.0.0. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 3.0.0 > [ ] -1 Do not release this package because... > > > Here's my +1. > > Thank you, all! > > Best, > > Tim
[jira] [Updated] (TIKA-4309) ExecutableParser: support MachO
[ https://issues.apache.org/jira/browse/TIKA-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4309: -- Fix Version/s: 3.0.1 > ExecutableParser: support MachO > --- > > Key: TIKA-4309 > URL: https://issues.apache.org/jira/browse/TIKA-4309 > Project: Tika > Issue Type: New Feature >Reporter: Alexey Pelykh >Priority: Major > Fix For: 3.0.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4321) Clean up Solr integration tests
Tim Allison created TIKA-4321: - Summary: Clean up Solr integration tests Key: TIKA-4321 URL: https://issues.apache.org/jira/browse/TIKA-4321 Project: Tika Issue Type: Task Reporter: Tim Allison Solr currently supports 9.x and 8.x. There's a vote on the Lucene side to stop support for Lucene 8.x soon. I think we can get rid of our unit tests for 6.x and 7.x. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4319) Wrong exit code upon successful start of Tika server
[ https://issues.apache.org/jira/browse/TIKA-4319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890181#comment-17890181 ] Tim Allison commented on TIKA-4319: --- Thank you for opening this. I'm not that familiar with the service scripts. If you can recommend a PR, that'd help. Maybe [~epugh] might have some time to review? > Wrong exit code upon successful start of Tika server > > > Key: TIKA-4319 > URL: https://issues.apache.org/jira/browse/TIKA-4319 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.9.2 > Environment: Tested on a Debian 12 VM with the following kernel: > 6.1.0-11-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-4 (2023-08-08) > x86_64 GNU/Linux >Reporter: Corrado Fiore >Priority: Trivial > Labels: easyfix > > I was trying to create a systemd unit file for Tika and I noticed that the > server will return an exit code of `1` instead of `0`. This makes it > confusing b/c systemd will report that the service crashed (whereas it > started correctly). > h3. Steps to reproduce the problem > {{{color:#00875a}user@test-instance{color}:~$ sudo su -c > "TIKA_INCLUDE=\"/etc/default/tika.in.sh\" /opt/tika/bin/tika start" - tika}} > {{Default server /opt/tika/}} > {{Waiting up to 180 seconds to see Tika running on port 9998 [-] }} > {{Started Tika server on port 9998 (pid=50039}} > {{50001). Happy extracting!}} > {{{color:#00875a}user@test-instance{color}:~$ echo $?}} > {{1}} > h3. Expected behaviour > A command that executes successfully should exit with an exit code 0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4320) Modernize opensearch integration tests
[ https://issues.apache.org/jira/browse/TIKA-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4320. --- Fix Version/s: 3.0.1 Resolution: Fixed > Modernize opensearch integration tests > -- > > Key: TIKA-4320 > URL: https://issues.apache.org/jira/browse/TIKA-4320 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Minor > Fix For: 3.0.1 > > > We should remove the Elasticsearch 7.x unit tests and the OpenSearch 1.x > tests. And, we can now use OpenSearch's {{testcontainers}} module. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4320) Modernize opensearch integration tests
Tim Allison created TIKA-4320: - Summary: Modernize opensearch integration tests Key: TIKA-4320 URL: https://issues.apache.org/jira/browse/TIKA-4320 Project: Tika Issue Type: Task Reporter: Tim Allison We should remove the Elasticsearch 7.x unit tests and the OpenSearch 1.x tests. And, we can now use OpenSearch's {{testcontainers}} module. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4170) Tika to extract Apple Key files
[ https://issues.apache.org/jira/browse/TIKA-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890103#comment-17890103 ] Tim Allison commented on TIKA-4170: --- Unrelated, but I just noticed that LibreOffice can open the example keynote files, which means that we _could_ write a bridge to LibreOffice or OpenOffice to extract text from these?! > Tika to extract Apple Key files > --- > > Key: TIKA-4170 > URL: https://issues.apache.org/jira/browse/TIKA-4170 > Project: Tika > Issue Type: Bug >Reporter: Tika User >Priority: Major > Attachments: Apple_key_file.zip, keynotecreated-2.9.3-SNAPSHOT.zip, > keynotecreated.zip > > > We are trying Tika to extract Apple Key files. The testing data is attached. > Could you please check why Tika can't extract the Apple Key files from > Tika-2.9.0? > The below testing result is for your reference. Thank you. > > Tika version --> Have child documents after extracting? > 2.4.1 --> YES > 2.6.0 --> YES > 2.7.0 --> YES > 2.8.0 --> YES > 2.9.0 --> NO > 2.9.1 --> NO -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4280. --- Fix Version/s: 3.0.0 Resolution: Fixed 3.0.0 rc1 is under vote now. > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Major > Fix For: 3.0.0 > > Attachments: 2PSMEFJEYU7EPAZXQQDD6OL2WOQLBJRY.zip > > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4170) Tika to extract Apple Key files
[ https://issues.apache.org/jira/browse/TIKA-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890096#comment-17890096 ] Tim Allison commented on TIKA-4170: --- I see similar behavior in 2.9.3-SNAPSHOT. > Tika to extract Apple Key files > --- > > Key: TIKA-4170 > URL: https://issues.apache.org/jira/browse/TIKA-4170 > Project: Tika > Issue Type: Bug >Reporter: Tika User >Priority: Major > Attachments: Apple_key_file.zip, keynotecreated-2.9.3-SNAPSHOT.zip, > keynotecreated.zip > > > We are trying Tika to extract Apple Key files. The testing data is attached. > Could you please check why Tika can't extract the Apple Key files from > Tika-2.9.0? > The below testing result is for your reference. Thank you. > > Tika version --> Have child documents after extracting? > 2.4.1 --> YES > 2.6.0 --> YES > 2.7.0 --> YES > 2.8.0 --> YES > 2.9.0 --> NO > 2.9.1 --> NO -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4170) Tika to extract Apple Key files
[ https://issues.apache.org/jira/browse/TIKA-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4170: -- Attachment: keynotecreated-2.9.3-SNAPSHOT.zip > Tika to extract Apple Key files > --- > > Key: TIKA-4170 > URL: https://issues.apache.org/jira/browse/TIKA-4170 > Project: Tika > Issue Type: Bug >Reporter: Tika User >Priority: Major > Attachments: Apple_key_file.zip, keynotecreated-2.9.3-SNAPSHOT.zip, > keynotecreated.zip > > > We are trying Tika to extract Apple Key files. The testing data is attached. > Could you please check why Tika can't extract the Apple Key files from > Tika-2.9.0? > The below testing result is for your reference. Thank you. > > Tika version --> Have child documents after extracting? > 2.4.1 --> YES > 2.6.0 --> YES > 2.7.0 --> YES > 2.8.0 --> YES > 2.9.0 --> NO > 2.9.1 --> NO -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4170) Tika to extract Apple Key files
[ https://issues.apache.org/jira/browse/TIKA-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890091#comment-17890091 ] Tim Allison commented on TIKA-4170: --- I'm attaching the output for Tika 3.x's tika-app {{java -jar tika-app-3.0.1-SNAPSHOT.jar -J -t keynotecreated.key > keynotecreated.json}}. With tesseract installed, Tika is extracting the attachments (well, images) and running those with tesseract. How are you calling Tika? Which attachments are not processed in which files? > Tika to extract Apple Key files > --- > > Key: TIKA-4170 > URL: https://issues.apache.org/jira/browse/TIKA-4170 > Project: Tika > Issue Type: Bug >Reporter: Tika User >Priority: Major > Attachments: Apple_key_file.zip, keynotecreated.zip > > > We are trying Tika to extract Apple Key files. The testing data is attached. > Could you please check why Tika can't extract the Apple Key files from > Tika-2.9.0? > The below testing result is for your reference. Thank you. > > Tika version --> Have child documents after extracting? > 2.4.1 --> YES > 2.6.0 --> YES > 2.7.0 --> YES > 2.8.0 --> YES > 2.9.0 --> NO > 2.9.1 --> NO -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4170) Tika to extract Apple Key files
[ https://issues.apache.org/jira/browse/TIKA-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4170: -- Attachment: keynotecreated.zip > Tika to extract Apple Key files > --- > > Key: TIKA-4170 > URL: https://issues.apache.org/jira/browse/TIKA-4170 > Project: Tika > Issue Type: Bug >Reporter: Tika User >Priority: Major > Attachments: Apple_key_file.zip, keynotecreated.zip > > > We are trying Tika to extract Apple Key files. The testing data is attached. > Could you please check why Tika can't extract the Apple Key files from > Tika-2.9.0? > The below testing result is for your reference. Thank you. > > Tika version --> Have child documents after extracting? > 2.4.1 --> YES > 2.6.0 --> YES > 2.7.0 --> YES > 2.8.0 --> YES > 2.9.0 --> NO > 2.9.1 --> NO -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4316) Goals for Tika 4.x
[ https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4316: -- Description: I proposed a tentative roadmap here: https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z Let's use this ticket to discuss some high level changes in 4.x Some thoughts: 1) Require Java 17 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies 3) Move tika-pipes to a separate module. Consider moving non-trivial implementations of tika-pipes components to a separate project? Consider using pf4j in tika-pipes and other components? 4) Remove unsupported dl4j and sentiment analysis and agepredictor modules and...? 5) Avoid fat jars where possible -- at least move tika-server to a lib/* pattern with the assembly plugin or pf4j instead of the shade plugin 6) Use an auto-correcting linter instead of checkstyle (cosium with google's style format?) 7) Remove the legacy external parser mechanism in favor of the external2 mechanism was: I proposed a tentative roadmap here: https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z Let's use this ticket to discuss some high level changes in 4.x Some thoughts: 1) Require Java 17 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies 3) Move tika-pipes to a separate module. Consider moving non-trivial implementations of tika-pipes components to a separate project? 4) Remove unsupported dl4j and sentiment analysis modules and...? 5) Avoid fat jars where possible -- at least move tika-server to a lib/* pattern with the assembly plugin instead of the shade plugin 6) Use an auto-correcting linter instead of checkstyle 7) Remove the legacy external parser mechanism in favor of the external2 mechanism > Goals for Tika 4.x > -- > > Key: TIKA-4316 > URL: https://issues.apache.org/jira/browse/TIKA-4316 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I proposed a tentative roadmap here: > https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z > Let's use this ticket to discuss some high level changes in 4.x > Some thoughts: > 1) Require Java 17 > 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies > 3) Move tika-pipes to a separate module. Consider moving non-trivial > implementations of tika-pipes components to a separate project? Consider > using pf4j in tika-pipes and other components? > 4) Remove unsupported dl4j and sentiment analysis and agepredictor modules > and...? > 5) Avoid fat jars where possible -- at least move tika-server to a lib/* > pattern with the assembly plugin or pf4j instead of the shade plugin > 6) Use an auto-correcting linter instead of checkstyle (cosium with google's > style format?) > 7) Remove the legacy external parser mechanism in favor of the external2 > mechanism -- This message was sent by Atlassian Jira (v8.20.10#820010)
Tika Roadmap 2.x, 3.x and beyond
All, I created a wiki page for the proposed roadmap: https://cwiki.apache.org/confluence/display/TIKA/Tika+Roadmap+--+2.x%2C+3.x+and+Beyond If there are any objections or alternate proposals to my initial proposal, please discuss those objections on the user/dev lists. Thank you, all! Cheers, Tim
[jira] [Commented] (TIKA-4318) Fix javadoc aggregate in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890037#comment-17890037 ] Tim Allison commented on TIKA-4318: --- My initial memory on this issue was that the autogenerated classes in grpc were causing problems. I then stubbed my toes for a while trying to get everything working and couldn't find any problems with grpc... until I went to make the release. During the final step in {{release:perform}}, javadoc errors out on grpc with the autogenerated classes. I turned back on the "ignore javadocs" for grpc in 3.x. Once we make the release, we can maybe move the autogenerated classes to their own package in 4.x and then configure javadoc to ignore that package. If anyone has any better ideas, please share. This was not a great way to spend time. > Fix javadoc aggregate in 3.x > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc > aggregate}}. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: 3.0.0 release?
Y, I completely agree about the surprises. I don't want to dump more work on you. :D On Tue, Oct 15, 2024 at 8:51 AM Tilman Hausherr wrote: > > On 08.10.2024 16:05, Tim Allison wrote: > > I realize that even dependency maintenance on three concurrent branches > > will be burdensome. Perhaps we fallback to "update dependencies before a > > release and before the regression tests" at least on the 2.x and 3.x > > branches? > > It wasn't burdensome for me, I've usually done this while watching TV. > Doing it only once before the release could mean nasty surprises at the > wrong moment. > > Tilman > > >
[VOTE] Release Apache Tika 3.0.0 Candidate #1
A candidate for the Tika 3.0.0 release is available at: https://dist.apache.org/repos/dist/dev/tika/3.0.0 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/3.0.0-rc1/ The SHA-512 checksum of the archive is c5eb92bc895d96492b2d2577d14df6187e46ab7c8a9f64aaf19d4f140f07caf1223d073c2cbb47b5519bb952eee50f39563004b8ad49906f45dffc9b6df74350. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1107/org/apache/tika Please vote on releasing this package as Apache Tika 3.0.0. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 3.0.0 [ ] -1 Do not release this package because... Here's my +1. Thank you, all! Best, Tim
[jira] [Updated] (TIKA-4316) Goals for Tika 4.x
[ https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4316: -- Description: I proposed a tentative roadmap here: https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z Let's use this ticket to discuss some high level changes in 4.x Some thoughts: 1) Require Java 17 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies 3) Move tika-pipes to a separate module. Consider moving non-trivial implementations of tika-pipes components to a separate project? 4) Remove unsupported dl4j and sentiment analysis modules and...? 5) Avoid fat jars where possible -- at least move tika-server to a lib/* pattern with the assembly plugin instead of the shade plugin 6) Use an auto-correcting linter instead of checkstyle 7) Remove the legacy external parser mechanism in favor of the external2 mechanism was: I proposed a tentative roadmap here: https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z Let's use this ticket to discuss some high level changes in 4.x Some thoughts: 1) Require Java 17 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies 3) Move tika-pipes to a separate module. Consider moving non-trivial implementations of tika-pipes components to a separate project? 4) Remove unsupported dl4j and sentiment analysis modules and...? 5) Avoid fat jars where possible -- at least move tika-server to a lib/* pattern with the assembly plugin instead of the shade plugin 6) Use an auto-correcting linter instead of checkstyle > Goals for Tika 4.x > -- > > Key: TIKA-4316 > URL: https://issues.apache.org/jira/browse/TIKA-4316 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I proposed a tentative roadmap here: > https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z > Let's use this ticket to discuss some high level changes in 4.x > Some thoughts: > 1) Require Java 17 > 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies > 3) Move tika-pipes to a separate module. Consider moving non-trivial > implementations of tika-pipes components to a separate project? > 4) Remove unsupported dl4j and sentiment analysis modules and...? > 5) Avoid fat jars where possible -- at least move tika-server to a lib/* > pattern with the assembly plugin instead of the shade plugin > 6) Use an auto-correcting linter instead of checkstyle > 7) Remove the legacy external parser mechanism in favor of the external2 > mechanism -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4318) Fix javadoc aggregate in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4318: -- Description: When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc aggregate}}. Let's see if we can get this to work for the 3.0.0 release? was: When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc aggregate}}. My memory is that there was a problem with the autogenerated code in grpc. Let's see if we can get this to work for the 3.0.0 release? > Fix javadoc aggregate in 3.x > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc > aggregate}}. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4318) Fix javadoc aggregate in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889690#comment-17889690 ] Tim Allison commented on TIKA-4318: --- Unless there's a better solution, I'll add back {{sourcepath}} into the repo, but I'll manually comment it out when I generate the javadocs for the release. There has to be a better option. :/ > Fix javadoc aggregate in 3.x > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc > aggregate}}. My memory is that there was a problem with the autogenerated > code in grpc. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4318) Fix javadoc aggregate in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889684#comment-17889684 ] Tim Allison edited comment on TIKA-4318 at 10/15/24 2:04 PM: - Removing {{src/main/java}} causes ci/cd to fail. {noformat} Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:3.10.1:aggregate (default-cli) on project tika: An error has occurred in Javadoc report generation: 2024-10-15T13:41:51.8978543Z [ERROR] Exit code: 2 2024-10-15T13:41:51.8979146Z [ERROR] error: No source files for package org.apache.tika.io {noformat} Locally, the build works with Java 11 and 17, but there are no javadocs output in {{target/reports}} or anywhere in {{target}}. was (Author: talli...@mitre.org): Removing {{src/main/java}} causes ci/cd to fail. Locally, the build works with Java 11 and 17, but there are no javadocs output in {{target/reports}} or anywhere in {{target}}. > Fix javadoc aggregate in 3.x > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc > aggregate}}. My memory is that there was a problem with the autogenerated > code in grpc. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4318) Fix javadoc aggregate in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889684#comment-17889684 ] Tim Allison commented on TIKA-4318: --- Removing {{src/main/java}} causes ci/cd to fail. Locally, the build works with Java 11 and 17, but there are no javadocs output in {{target/reports}} or anywhere in {{target}}. > Fix javadoc aggregate in 3.x > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc > aggregate}}. My memory is that there was a problem with the autogenerated > code in grpc. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (TIKA-4318) Fix javadoc aggregate in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-4318: --- > Fix javadoc aggregate in 3.x > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc > aggregate}}. My memory is that there was a problem with the autogenerated > code in grpc. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4318) Fix javadoc aggregate in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889653#comment-17889653 ] Tim Allison commented on TIKA-4318: --- I made a very small modification in main just now, and {{javadoc:aggregate}} appears to be working again without turning off javadocs for grpc and without modifying any of the code in the grpc module. > Fix javadoc aggregate in 3.x > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc > aggregate}}. My memory is that there was a problem with the autogenerated > code in grpc. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4318) Fix javadoc aggregate in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4318: -- Description: When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc aggregate}}. My memory is that there was a problem with the autogenerated code in grpc. Let's see if we can get this to work for the 3.0.0 release? was: When I ran the 3.0.0-BETA2 release, I ran into problems with javadoc on the autogenerated code in grpc. It looks like we can exclude packages in javadoc if we move the autogenerated code to its own package. Let's see if we can get this to work for the 3.0.0 release? > Fix javadoc aggregate in 3.x > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with {{javadoc > aggregate}}. My memory is that there was a problem with the autogenerated > code in grpc. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4318) Move auto-generated code to separate package
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4318. --- Fix Version/s: 3.0.0 Assignee: Tim Allison Resolution: Fixed > Move auto-generated code to separate package > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with javadoc on the > autogenerated code in grpc. It looks like we can exclude packages in javadoc > if we move the autogenerated code to its own package. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4318) Fix javadoc aggregate in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4318: -- Summary: Fix javadoc aggregate in 3.x (was: Move auto-generated code to separate package) > Fix javadoc aggregate in 3.x > > > Key: TIKA-4318 > URL: https://issues.apache.org/jira/browse/TIKA-4318 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Assignee: Tim Allison >Priority: Minor > Fix For: 3.0.0 > > > When I ran the 3.0.0-BETA2 release, I ran into problems with javadoc on the > autogenerated code in grpc. It looks like we can exclude packages in javadoc > if we move the autogenerated code to its own package. > Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4318) Move auto-generated code to separate package
Tim Allison created TIKA-4318: - Summary: Move auto-generated code to separate package Key: TIKA-4318 URL: https://issues.apache.org/jira/browse/TIKA-4318 Project: Tika Issue Type: Task Reporter: Tim Allison When I ran the 3.0.0-BETA2 release, I ran into problems with javadoc on the autogenerated code in grpc. It looks like we can exclude packages in javadoc if we move the autogenerated code to its own package. Let's see if we can get this to work for the 3.0.0 release? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4316) Goals for Tika 4.x
[ https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4316: -- Description: I proposed a tentative roadmap here: https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z Let's use this ticket to discuss some high level changes in 4.x Some thoughts: 1) Require Java 17 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies 3) Move tika-pipes to a separate module. Consider moving non-trivial implementations of tika-pipes components to a separate project? 4) Remove unsupported dl4j and sentiment analysis modules and...? 5) Avoid fat jars where possible -- at least move tika-server to a lib/* pattern with the assembly plugin instead of the shade plugin 6) Use an auto-correcting linter instead of checkstyle was: I proposed a tentative roadmap here: https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z Let's use this ticket to discuss some high level changes in 4.x Some thoughts: 1) Require Java 17 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies 3) Move tika-pipes to a separate module. Consider moving non-trivial implementations of tika-pipes components to a separate project? 4) Remove unsupported dl4j and sentiment analysis modules and...? 5) Avoid fat jars where possible -- at least move tika-server to a lib/* pattern with the assembly plugin instead of the shade plugin > Goals for Tika 4.x > -- > > Key: TIKA-4316 > URL: https://issues.apache.org/jira/browse/TIKA-4316 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I proposed a tentative roadmap here: > https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z > Let's use this ticket to discuss some high level changes in 4.x > Some thoughts: > 1) Require Java 17 > 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies > 3) Move tika-pipes to a separate module. Consider moving non-trivial > implementations of tika-pipes components to a separate project? > 4) Remove unsupported dl4j and sentiment analysis modules and...? > 5) Avoid fat jars where possible -- at least move tika-server to a lib/* > pattern with the assembly plugin instead of the shade plugin > 6) Use an auto-correcting linter instead of checkstyle -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: 3.0.0 release?
With help from Tilman, I think we're all set on the TextCSVParser regression. Any other blockers on 3.0.0? Any objections to the basic plan below? On Thu, Oct 10, 2024 at 7:03 AM Tim Allison wrote: > Regression results are available: > https://issues.apache.org/jira/browse/TIKA-4280?focusedCommentId=17888235&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17888235 > > I heard one off-list +1, and no -1s publicly or privately. Please let me > know if anyone has any misgivings about the following plan. > > On Tue, Oct 8, 2024 at 10:05 AM Tim Allison wrote: > >> All, >> >> In looking at my schedule and thinking about the project more broadly, >> I'm worried that moving pipes to its own module in Tika 3.x after we've had >> two beta releases might be a large step. What would you think about this >> timeline: >> >> Oct 2024 -- Release 3.0.0 (after regression tests and fixes and with >> warnings about upcoming changes in tika-pipes in 4.x) >> Oct 2024 -- move main branch to 4.x (and Java 17) and move tika-pipes to >> its own module or create a standalone tika-pipes project >> ??? -- start 4.0.0-BETA releases ASAP >> April 2025 (6 months from 3.x release) -- end support for 2.x (and >> thereby Java 8) >> April 2025 (6 months from Oct 2024 or earlier???) -- release 4.0.0 >> Oct 2025 (one year from 3.x release) -- end support for 3.x (and thereby >> Java 11) >> >> I realize that even dependency maintenance on three concurrent branches >> will be burdensome. Perhaps we fallback to "update dependencies before a >> release and before the regression tests" at least on the 2.x and 3.x >> branches? >> >> What do you all think? Many thanks! >> >> Best, >> >> Tim >> >
[jira] [Commented] (TIKA-4317) Abusive content on https://corpora.tika.apache.org/
[ https://issues.apache.org/jira/browse/TIKA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889259#comment-17889259 ] Tim Allison commented on TIKA-4317: --- Deleted. Thank you for the report. > Abusive content on https://corpora.tika.apache.org/ > --- > > Key: TIKA-4317 > URL: https://issues.apache.org/jira/browse/TIKA-4317 > Project: Tika > Issue Type: Bug > Components: site >Reporter: Zoran Regvart >Assignee: Tim Allison >Priority: Major > > The Apache Camel team has been notified by Google of abusive content hosted > on https://corpora.tika.apache.org/, with the assumption that this is somehow > related to https://camel.apache.org. The scanning done by Google is against > the whole apache.org domain, so implication is that any abusive content found > on any domain within apache.org will be accredited and affect other domains > within apache.org. > Learn about abusive experiences here: > https://support.google.com/webtools/answer/7347327. > Singled out page from Google report (content & possibly security warning): > {code}https://corpora.tika.apache.org/base/docs/commoncrawl3/QK/QKKJTNDRIVLIPP7433IFC3EF3UVOSPIB{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4317) Abusive content on https://corpora.tika.apache.org/
[ https://issues.apache.org/jira/browse/TIKA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4317. --- Resolution: Fixed > Abusive content on https://corpora.tika.apache.org/ > --- > > Key: TIKA-4317 > URL: https://issues.apache.org/jira/browse/TIKA-4317 > Project: Tika > Issue Type: Bug > Components: site >Reporter: Zoran Regvart > Assignee: Tim Allison >Priority: Major > > The Apache Camel team has been notified by Google of abusive content hosted > on https://corpora.tika.apache.org/, with the assumption that this is somehow > related to https://camel.apache.org. The scanning done by Google is against > the whole apache.org domain, so implication is that any abusive content found > on any domain within apache.org will be accredited and affect other domains > within apache.org. > Learn about abusive experiences here: > https://support.google.com/webtools/answer/7347327. > Singled out page from Google report (content & possibly security warning): > {code}https://corpora.tika.apache.org/base/docs/commoncrawl3/QK/QKKJTNDRIVLIPP7433IFC3EF3UVOSPIB{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: Is there a way to publish to docker.io/apache ?
Last I looked into this, I think infra granted it to three (?) people per project. Maybe check with them and see if that still applies and then see who has karma that might be willing to relinquish it? On Fri, Oct 11, 2024 at 12:22 PM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > I have an image of Apache Tika Grpc that is on Dockerhub here: > > ndipiazza/tika-grpc:3.0.0-BETA2 > > I have some interest in putting that in an official > docker.io/apache/tika-grpc:3.0.0-BETA2 > > Is this possible to get a dockerhub account for my apache credentials? >
[jira] [Comment Edited] (TIKA-4316) Goals for Tika 4.x
[ https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888628#comment-17888628 ] Tim Allison edited comment on TIKA-4316 at 10/11/24 12:29 PM: -- All of this is still open for discussion obv -- we still haven't even launched 3.x -- and I look forward to discussion was (Author: talli...@mitre.org): All of this is still open for discussion, and I look forward to discussion > Goals for Tika 4.x > -- > > Key: TIKA-4316 > URL: https://issues.apache.org/jira/browse/TIKA-4316 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I proposed a tentative roadmap here: > https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z > Let's use this ticket to discuss some high level changes in 4.x > Some thoughts: > 1) Require Java 17 > 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies > 3) Move tika-pipes to a separate module. Consider moving non-trivial > implementations of tika-pipes components to a separate project? > 4) Remove unsupported dl4j and sentiment analysis modules and...? > 5) Avoid fat jars where possible -- at least move tika-server to a lib/* > pattern with the assembly plugin instead of the shade plugin -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4316) Goals for Tika 4.x
[ https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888628#comment-17888628 ] Tim Allison commented on TIKA-4316: --- All of this is still open for discussion, and I look forward to discussion > Goals for Tika 4.x > -- > > Key: TIKA-4316 > URL: https://issues.apache.org/jira/browse/TIKA-4316 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Major > > I proposed a tentative roadmap here: > https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z > Let's use this ticket to discuss some high level changes in 4.x > Some thoughts: > 1) Require Java 17 > 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies > 3) Move tika-pipes to a separate module. Consider moving non-trivial > implementations of tika-pipes components to a separate project? > 4) Remove unsupported dl4j and sentiment analysis modules and...? > 5) Avoid fat jars where possible -- at least move tika-server to a lib/* > pattern with the assembly plugin instead of the shade plugin -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4316) Goals for Tika 4.x
[ https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4316: -- Summary: Goals for Tika 4.x (was: Goals for tika 4.x) > Goals for Tika 4.x > -- > > Key: TIKA-4316 > URL: https://issues.apache.org/jira/browse/TIKA-4316 > Project: Tika > Issue Type: Task > Reporter: Tim Allison >Priority: Major > > I proposed a tentative roadmap here: > https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z > Let's use this ticket to discuss some high level changes in 4.x > Some thoughts: > 1) Require Java 17 > 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies > 3) Move tika-pipes to a separate module. Consider moving non-trivial > implementations of tika-pipes components to a separate project? > 4) Remove unsupported dl4j and sentiment analysis modules and...? > 5) Avoid fat jars where possible -- at least move tika-server to a lib/* > pattern with the assembly plugin instead of the shade plugin -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4316) Goals for tika 4.x
[ https://issues.apache.org/jira/browse/TIKA-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4316: -- Description: I proposed a tentative roadmap here: https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z Let's use this ticket to discuss some high level changes in 4.x Some thoughts: 1) Require Java 17 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies 3) Move tika-pipes to a separate module. Consider moving non-trivial implementations of tika-pipes components to a separate project? 4) Remove unsupported dl4j and sentiment analysis modules and...? 5) Avoid fat jars where possible -- at least move tika-server to a lib/* pattern with the assembly plugin instead of the shade plugin was: I proposed a roadmap here: https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z Let's use this ticket to discuss some high level changes in 4.x > Goals for tika 4.x > -- > > Key: TIKA-4316 > URL: https://issues.apache.org/jira/browse/TIKA-4316 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I proposed a tentative roadmap here: > https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z > Let's use this ticket to discuss some high level changes in 4.x > Some thoughts: > 1) Require Java 17 > 2) Remove tika-batch in favor of tika-pipes with filesystem dependencies > 3) Move tika-pipes to a separate module. Consider moving non-trivial > implementations of tika-pipes components to a separate project? > 4) Remove unsupported dl4j and sentiment analysis modules and...? > 5) Avoid fat jars where possible -- at least move tika-server to a lib/* > pattern with the assembly plugin instead of the shade plugin -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4316) Goals for tika 4.x
Tim Allison created TIKA-4316: - Summary: Goals for tika 4.x Key: TIKA-4316 URL: https://issues.apache.org/jira/browse/TIKA-4316 Project: Tika Issue Type: Task Reporter: Tim Allison I proposed a roadmap here: https://lists.apache.org/thread/9yfzf6qwpc7c6qnlp4tdwsdrnjvv7r1z Let's use this ticket to discuss some high level changes in 4.x -- This message was sent by Atlassian Jira (v8.20.10#820010)