[jira] [Commented] (TIKA-4261) Add attachment type metadata filter

2024-05-24 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849394#comment-17849394
 ] 

Hudson commented on TIKA-4261:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1638 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1638/])
TIKA-4261 -- add a clear by attachment type metadata filter (#1777) (github: 
[https://github.com/apache/tika/commit/c1f07222b147ce778eae5d9fef349a84939965e5])
* (add) 
tika-core/src/test/resources/org/apache/tika/config/TIKA-4261-clear-by-embedded-type.xml
* (edit) 
tika-core/src/test/java/org/apache/tika/metadata/filter/TestMetadataFilter.java
* (add) 
tika-core/src/main/java/org/apache/tika/metadata/filter/ClearByAttachmentTypeMetadataFilter.java


> Add attachment type metadata filter
> ---
>
> Key: TIKA-4261
> URL: https://issues.apache.org/jira/browse/TIKA-4261
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> For some users who are using the /rmeta endpoint or -J option in tika-app, 
> inlining ocr'd content, there is no need to include the metadata object for 
> the inlined image. Let's add a metadata filter to remove these metadata 
> objects.
> The default behavior will be as before. Everything is included. Users need to 
> configure this to remove these inline objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849384#comment-17849384
 ] 

ASF GitHub Bot commented on TIKA-4260:
--

tballison commented on PR #1776:
URL: https://github.com/apache/tika/pull/1776#issuecomment-2130252325

   I'm now getting a clean build with `-DskipTests` lol... That's a step at 
least.
   
   The big TODO is to add serialization of the ParseContext in 
https://github.com/apache/tika/blob/main/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonFetchEmitTuple.java




> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4260 -- add ParseContext to fetchers and emitters [tika]

2024-05-24 Thread via GitHub


tballison commented on PR #1776:
URL: https://github.com/apache/tika/pull/1776#issuecomment-2130252325

   I'm now getting a clean build with `-DskipTests` lol... That's a step at 
least.
   
   The big TODO is to add serialization of the ParseContext in 
https://github.com/apache/tika/blob/main/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonFetchEmitTuple.java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4261) Add attachment type metadata filter

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849379#comment-17849379
 ] 

ASF GitHub Bot commented on TIKA-4261:
--

tballison merged PR #1777:
URL: https://github.com/apache/tika/pull/1777




> Add attachment type metadata filter
> ---
>
> Key: TIKA-4261
> URL: https://issues.apache.org/jira/browse/TIKA-4261
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> For some users who are using the /rmeta endpoint or -J option in tika-app, 
> inlining ocr'd content, there is no need to include the metadata object for 
> the inlined image. Let's add a metadata filter to remove these metadata 
> objects.
> The default behavior will be as before. Everything is included. Users need to 
> configure this to remove these inline objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4261 -- add attachment type metadata filter [tika]

2024-05-24 Thread via GitHub


tballison merged PR #1777:
URL: https://github.com/apache/tika/pull/1777


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4261) Add attachment type metadata filter

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849369#comment-17849369
 ] 

ASF GitHub Bot commented on TIKA-4261:
--

tballison opened a new pull request, #1777:
URL: https://github.com/apache/tika/pull/1777

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add attachment type metadata filter
> ---
>
> Key: TIKA-4261
> URL: https://issues.apache.org/jira/browse/TIKA-4261
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> For some users who are using the /rmeta endpoint or -J option in tika-app, 
> inlining ocr'd content, there is no need to include the metadata object for 
> the inlined image. Let's add a metadata filter to remove these metadata 
> objects.
> The default behavior will be as before. Everything is included. Users need to 
> configure this to remove these inline objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4261 -- add attachment type metadata filter [tika]

2024-05-24 Thread via GitHub


tballison opened a new pull request, #1777:
URL: https://github.com/apache/tika/pull/1777

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (TIKA-4261) Add attachment type metadata filter

2024-05-24 Thread Tim Allison (Jira)
Tim Allison created TIKA-4261:
-

 Summary: Add attachment type metadata filter
 Key: TIKA-4261
 URL: https://issues.apache.org/jira/browse/TIKA-4261
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


For some users who are using the /rmeta endpoint or -J option in tika-app, 
inlining ocr'd content, there is no need to include the metadata object for the 
inlined image. Let's add a metadata filter to remove these metadata objects.

The default behavior will be as before. Everything is included. Users need to 
configure this to remove these inline objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4259) Decouple xml parser stuff from ParseContext

2024-05-24 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849321#comment-17849321
 ] 

Hudson commented on TIKA-4259:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1637 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1637/])
TIKA-4259 (#1775) (github: 
[https://github.com/apache/tika/commit/035682cdd9e993cd441f005f62a3b36f410c50b6])
* (edit) CHANGES.txt
* (edit) tika-core/src/main/java/org/apache/tika/utils/XMLReaderUtils.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java
* (edit) tika-core/src/main/java/org/apache/tika/parser/ParseContext.java
* (edit) tika-core/src/test/java/org/apache/tika/parser/mock/MockParser.java


> Decouple xml parser stuff from ParseContext
> ---
>
> Key: TIKA-4259
> URL: https://issues.apache.org/jira/browse/TIKA-4259
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> ParseContext has some xmlreader convenience methods. We should move those to 
> XMLReaderUtils in 3.x to simplify ParseContext's api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4259) Decouple xml parser stuff from ParseContext

2024-05-24 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4259.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Decouple xml parser stuff from ParseContext
> ---
>
> Key: TIKA-4259
> URL: https://issues.apache.org/jira/browse/TIKA-4259
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> ParseContext has some xmlreader convenience methods. We should move those to 
> XMLReaderUtils in 3.x to simplify ParseContext's api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4259) Decouple xml parser stuff from ParseContext

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849299#comment-17849299
 ] 

ASF GitHub Bot commented on TIKA-4259:
--

tballison merged PR #1775:
URL: https://github.com/apache/tika/pull/1775




> Decouple xml parser stuff from ParseContext
> ---
>
> Key: TIKA-4259
> URL: https://issues.apache.org/jira/browse/TIKA-4259
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> ParseContext has some xmlreader convenience methods. We should move those to 
> XMLReaderUtils in 3.x to simplify ParseContext's api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4259 [tika]

2024-05-24 Thread via GitHub


tballison merged PR #1775:
URL: https://github.com/apache/tika/pull/1775


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849298#comment-17849298
 ] 

Tim Allison commented on TIKA-4260:
---

That PR currently only works on tika-core. More needs to be done before we can 
merge this if this is the direction we'd like to go.

> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849296#comment-17849296
 ] 

ASF GitHub Bot commented on TIKA-4260:
--

tballison opened a new pull request, #1776:
URL: https://github.com/apache/tika/pull/1776

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849297#comment-17849297
 ] 

ASF GitHub Bot commented on TIKA-4260:
--

tballison commented on PR #1776:
URL: https://github.com/apache/tika/pull/1776#issuecomment-2129532368

   Current status 

> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4260 -- add ParseContext to fetchers and emitters [tika]

2024-05-24 Thread via GitHub


tballison commented on PR #1776:
URL: https://github.com/apache/tika/pull/1776#issuecomment-2129532368

   Current status -- only working in tika-core. Much more needs to be done 
throughout the repo to get this working.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] TIKA-4260 -- add ParseContext to fetchers and emitters [tika]

2024-05-24 Thread via GitHub


tballison opened a new pull request, #1776:
URL: https://github.com/apache/tika/pull/1776

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849288#comment-17849288
 ] 

Tim Allison commented on TIKA-4243:
---

[~ndipiazza], I added parseContext to fetchers and emitters on the TIKA-4260 
branch. That might be a good start for serializing the ParseContext?

All, let me know what you think.

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849103#comment-17849103
 ] 

Tim Allison edited comment on TIKA-4243 at 5/24/24 1:00 PM:


Proposed basic roadmap:

Add parseContext to fetchers and emitters (and pipesReporter?)
Serialize ParseContext as is...
Allow for serialization of current XConfigs, eg. PDFParserConfig, etc.
Add creation of parsers with e.g. new PDFParser(ParseContext context).
Wire config stuff into tika-server, tika-pipes, tika-app
Merge tika-grpc-server with new config options

This would require serialization of classes that users want to be able to 
configure + serialization.

This would allow us to get rid of all of our custom serialization stuff for 
Tika 4.x.



was (Author: talli...@mitre.org):
Proposed basic roadmap:

Serialize ParseContext as is...
Allow for serialization of current XConfigs, eg. PDFParserConfig, etc.
Add creation of parsers with e.g. new PDFParser(ParseContext context).
Wire config stuff into tika-server, tika-pipes, tika-app
Merge tika-grpc-server with new config options

This would require serialization of classes that users want to be able to 
configure + serialization.

This would allow us to get rid of all of our custom serialization stuff for 
Tika 4.x.


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)