[jira] [Commented] (TIKA-4261) Add attachment type metadata filter
[ https://issues.apache.org/jira/browse/TIKA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849394#comment-17849394 ] Hudson commented on TIKA-4261: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1638 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1638/]) TIKA-4261 -- add a clear by attachment type metadata filter (#1777) (github: [https://github.com/apache/tika/commit/c1f07222b147ce778eae5d9fef349a84939965e5]) * (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-4261-clear-by-embedded-type.xml * (edit) tika-core/src/test/java/org/apache/tika/metadata/filter/TestMetadataFilter.java * (add) tika-core/src/main/java/org/apache/tika/metadata/filter/ClearByAttachmentTypeMetadataFilter.java > Add attachment type metadata filter > --- > > Key: TIKA-4261 > URL: https://issues.apache.org/jira/browse/TIKA-4261 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > For some users who are using the /rmeta endpoint or -J option in tika-app, > inlining ocr'd content, there is no need to include the metadata object for > the inlined image. Let's add a metadata filter to remove these metadata > objects. > The default behavior will be as before. Everything is included. Users need to > configure this to remove these inline objects. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849384#comment-17849384 ] ASF GitHub Bot commented on TIKA-4260: -- tballison commented on PR #1776: URL: https://github.com/apache/tika/pull/1776#issuecomment-2130252325 I'm now getting a clean build with `-DskipTests` lol... That's a step at least. The big TODO is to add serialization of the ParseContext in https://github.com/apache/tika/blob/main/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonFetchEmitTuple.java > Add parse context to the fetcher interface in 3.x > - > > Key: TIKA-4260 > URL: https://issues.apache.org/jira/browse/TIKA-4260 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4260 -- add ParseContext to fetchers and emitters [tika]
tballison commented on PR #1776: URL: https://github.com/apache/tika/pull/1776#issuecomment-2130252325 I'm now getting a clean build with `-DskipTests` lol... That's a step at least. The big TODO is to add serialization of the ParseContext in https://github.com/apache/tika/blob/main/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonFetchEmitTuple.java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4261) Add attachment type metadata filter
[ https://issues.apache.org/jira/browse/TIKA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849379#comment-17849379 ] ASF GitHub Bot commented on TIKA-4261: -- tballison merged PR #1777: URL: https://github.com/apache/tika/pull/1777 > Add attachment type metadata filter > --- > > Key: TIKA-4261 > URL: https://issues.apache.org/jira/browse/TIKA-4261 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > For some users who are using the /rmeta endpoint or -J option in tika-app, > inlining ocr'd content, there is no need to include the metadata object for > the inlined image. Let's add a metadata filter to remove these metadata > objects. > The default behavior will be as before. Everything is included. Users need to > configure this to remove these inline objects. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4261 -- add attachment type metadata filter [tika]
tballison merged PR #1777: URL: https://github.com/apache/tika/pull/1777 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4261) Add attachment type metadata filter
[ https://issues.apache.org/jira/browse/TIKA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849369#comment-17849369 ] ASF GitHub Bot commented on TIKA-4261: -- tballison opened a new pull request, #1777: URL: https://github.com/apache/tika/pull/1777 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! > Add attachment type metadata filter > --- > > Key: TIKA-4261 > URL: https://issues.apache.org/jira/browse/TIKA-4261 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > For some users who are using the /rmeta endpoint or -J option in tika-app, > inlining ocr'd content, there is no need to include the metadata object for > the inlined image. Let's add a metadata filter to remove these metadata > objects. > The default behavior will be as before. Everything is included. Users need to > configure this to remove these inline objects. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] TIKA-4261 -- add attachment type metadata filter [tika]
tballison opened a new pull request, #1777: URL: https://github.com/apache/tika/pull/1777 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (TIKA-4261) Add attachment type metadata filter
Tim Allison created TIKA-4261: - Summary: Add attachment type metadata filter Key: TIKA-4261 URL: https://issues.apache.org/jira/browse/TIKA-4261 Project: Tika Issue Type: Task Reporter: Tim Allison For some users who are using the /rmeta endpoint or -J option in tika-app, inlining ocr'd content, there is no need to include the metadata object for the inlined image. Let's add a metadata filter to remove these metadata objects. The default behavior will be as before. Everything is included. Users need to configure this to remove these inline objects. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4259) Decouple xml parser stuff from ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849321#comment-17849321 ] Hudson commented on TIKA-4259: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1637 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1637/]) TIKA-4259 (#1775) (github: [https://github.com/apache/tika/commit/035682cdd9e993cd441f005f62a3b36f410c50b6]) * (edit) CHANGES.txt * (edit) tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java * (edit) tika-core/src/main/java/org/apache/tika/parser/ParseContext.java * (edit) tika-core/src/test/java/org/apache/tika/parser/mock/MockParser.java > Decouple xml parser stuff from ParseContext > --- > > Key: TIKA-4259 > URL: https://issues.apache.org/jira/browse/TIKA-4259 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > ParseContext has some xmlreader convenience methods. We should move those to > XMLReaderUtils in 3.x to simplify ParseContext's api. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4259) Decouple xml parser stuff from ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4259. --- Fix Version/s: 3.0.0 Resolution: Fixed > Decouple xml parser stuff from ParseContext > --- > > Key: TIKA-4259 > URL: https://issues.apache.org/jira/browse/TIKA-4259 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > ParseContext has some xmlreader convenience methods. We should move those to > XMLReaderUtils in 3.x to simplify ParseContext's api. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4259) Decouple xml parser stuff from ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849299#comment-17849299 ] ASF GitHub Bot commented on TIKA-4259: -- tballison merged PR #1775: URL: https://github.com/apache/tika/pull/1775 > Decouple xml parser stuff from ParseContext > --- > > Key: TIKA-4259 > URL: https://issues.apache.org/jira/browse/TIKA-4259 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > ParseContext has some xmlreader convenience methods. We should move those to > XMLReaderUtils in 3.x to simplify ParseContext's api. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4259 [tika]
tballison merged PR #1775: URL: https://github.com/apache/tika/pull/1775 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849298#comment-17849298 ] Tim Allison commented on TIKA-4260: --- That PR currently only works on tika-core. More needs to be done before we can merge this if this is the direction we'd like to go. > Add parse context to the fetcher interface in 3.x > - > > Key: TIKA-4260 > URL: https://issues.apache.org/jira/browse/TIKA-4260 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849296#comment-17849296 ] ASF GitHub Bot commented on TIKA-4260: -- tballison opened a new pull request, #1776: URL: https://github.com/apache/tika/pull/1776 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! > Add parse context to the fetcher interface in 3.x > - > > Key: TIKA-4260 > URL: https://issues.apache.org/jira/browse/TIKA-4260 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849297#comment-17849297 ] ASF GitHub Bot commented on TIKA-4260: -- tballison commented on PR #1776: URL: https://github.com/apache/tika/pull/1776#issuecomment-2129532368 Current status > Add parse context to the fetcher interface in 3.x > - > > Key: TIKA-4260 > URL: https://issues.apache.org/jira/browse/TIKA-4260 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4260 -- add ParseContext to fetchers and emitters [tika]
tballison commented on PR #1776: URL: https://github.com/apache/tika/pull/1776#issuecomment-2129532368 Current status -- only working in tika-core. Much more needs to be done throughout the repo to get this working. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] TIKA-4260 -- add ParseContext to fetchers and emitters [tika]
tballison opened a new pull request, #1776: URL: https://github.com/apache/tika/pull/1776 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849288#comment-17849288 ] Tim Allison commented on TIKA-4243: --- [~ndipiazza], I added parseContext to fetchers and emitters on the TIKA-4260 branch. That might be a good start for serializing the ParseContext? All, let me know what you think. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849103#comment-17849103 ] Tim Allison edited comment on TIKA-4243 at 5/24/24 1:00 PM: Proposed basic roadmap: Add parseContext to fetchers and emitters (and pipesReporter?) Serialize ParseContext as is... Allow for serialization of current XConfigs, eg. PDFParserConfig, etc. Add creation of parsers with e.g. new PDFParser(ParseContext context). Wire config stuff into tika-server, tika-pipes, tika-app Merge tika-grpc-server with new config options This would require serialization of classes that users want to be able to configure + serialization. This would allow us to get rid of all of our custom serialization stuff for Tika 4.x. was (Author: talli...@mitre.org): Proposed basic roadmap: Serialize ParseContext as is... Allow for serialization of current XConfigs, eg. PDFParserConfig, etc. Add creation of parsers with e.g. new PDFParser(ParseContext context). Wire config stuff into tika-server, tika-pipes, tika-app Merge tika-grpc-server with new config options This would require serialization of classes that users want to be able to configure + serialization. This would allow us to get rid of all of our custom serialization stuff for Tika 4.x. > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)