[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-07-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868678#comment-17868678
 ] 

Nicholas DiPiazza commented on TIKA-4280:
-

So for tika server we normally produced a jar file

Now we will produce a jar file along with a directory of other jar files

You can run the server using maven via exec:java

And when you build for production, do we have to add some sort of .sh/bat file 
that shows them how to launch it? 

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4286) fix issues where MS graph fetcher is missing deps

2024-07-22 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4286:
---

 Summary: fix issues where MS graph fetcher is missing deps
 Key: TIKA-4286
 URL: https://issues.apache.org/jira/browse/TIKA-4286
 Project: Tika
  Issue Type: Task
  Components: tika-pipes
Affects Versions: 3.0.0-BETA
Reporter: Nicholas DiPiazza


when trying to save the MS Graph Fetcher in Tika Grpc, it would error out due 
to missing classes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4272) create tika docker image for tika-grpc

2024-06-26 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4272:

Description: 
now that the tika-grpc branch has been merge to main, we need a tika-grpc 
server image. 

i thought for a bit about using the same tika docker image as we already use 
but that is probably not a good idea because there are vastly different jar 
files needed for tika-grpc 

  was:now that the tika-grpc branch has been merge to main, tika-docker image 
needs to be changed so that we can use  tika-grpc... same thing as tika-server 
but with the grpc runner instead of the tika rest services


> create tika docker image for tika-grpc
> --
>
> Key: TIKA-4272
> URL: https://issues.apache.org/jira/browse/TIKA-4272
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> now that the tika-grpc branch has been merge to main, we need a tika-grpc 
> server image. 
> i thought for a bit about using the same tika docker image as we already use 
> but that is probably not a good idea because there are vastly different jar 
> files needed for tika-grpc 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4272) make changes to tika docker image so that tika can run grpc server or rest server

2024-06-26 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4272:

Description: now that the tika-grpc branch has been merge to main, 
tika-docker image needs to be changed so that we can use  tika-grpc... same 
thing as tika-server but with the grpc runner instead of the tika rest services 
 (was: now that the tika-grpc branch has been merge to main, create a new 
tika-docker image for tika-grpc... same thing as tika-server but with the grpc 
runner instead of the tika rest services)

> make changes to tika docker image so that tika can run grpc server or rest 
> server
> -
>
> Key: TIKA-4272
> URL: https://issues.apache.org/jira/browse/TIKA-4272
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> now that the tika-grpc branch has been merge to main, tika-docker image needs 
> to be changed so that we can use  tika-grpc... same thing as tika-server but 
> with the grpc runner instead of the tika rest services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4272) create tika docker image for tika-grpc

2024-06-26 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4272:

Summary: create tika docker image for tika-grpc  (was: make changes to tika 
docker image so that tika can run grpc server or rest server)

> create tika docker image for tika-grpc
> --
>
> Key: TIKA-4272
> URL: https://issues.apache.org/jira/browse/TIKA-4272
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> now that the tika-grpc branch has been merge to main, tika-docker image needs 
> to be changed so that we can use  tika-grpc... same thing as tika-server but 
> with the grpc runner instead of the tika rest services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4272) make changes to tika docker image so that tika can run grpc server or rest server

2024-06-26 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4272:

Summary: make changes to tika docker image so that tika can run grpc server 
or rest server  (was: create a Docker image for tika-grpc-server)

> make changes to tika docker image so that tika can run grpc server or rest 
> server
> -
>
> Key: TIKA-4272
> URL: https://issues.apache.org/jira/browse/TIKA-4272
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> now that the tika-grpc branch has been merge to main, create a new 
> tika-docker image for tika-grpc... same thing as tika-server but with the 
> grpc runner instead of the tika rest services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4272) create a Docker image for tika-grpc-server

2024-06-26 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4272:

Summary: create a Docker image for tika-grpc-server  (was: create an image 
for tika-grpc-server)

> create a Docker image for tika-grpc-server
> --
>
> Key: TIKA-4272
> URL: https://issues.apache.org/jira/browse/TIKA-4272
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> now that the tika-grpc branch has been merge to main, create a new 
> tika-docker image for tika-grpc... same thing as tika-server but with the 
> grpc runner instead of the tika rest services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4273) create a helm deployment for tika-grpc

2024-06-26 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4273:

Description: after we have created a tika-grpc image, we need to create a 
deployment in the tika helm chart.

> create a helm deployment for tika-grpc
> --
>
> Key: TIKA-4273
> URL: https://issues.apache.org/jira/browse/TIKA-4273
> Project: Tika
>  Issue Type: New Feature
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> after we have created a tika-grpc image, we need to create a deployment in 
> the tika helm chart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4273) create a helm deployment for tika-grpc

2024-06-26 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4273:
---

 Summary: create a helm deployment for tika-grpc
 Key: TIKA-4273
 URL: https://issues.apache.org/jira/browse/TIKA-4273
 Project: Tika
  Issue Type: New Feature
Reporter: Nicholas DiPiazza






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4272) create an image for tika-grpc-server

2024-06-26 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4272:
---

 Summary: create an image for tika-grpc-server
 Key: TIKA-4272
 URL: https://issues.apache.org/jira/browse/TIKA-4272
 Project: Tika
  Issue Type: New Feature
  Components: tika-pipes
Reporter: Nicholas DiPiazza


now that the tika-grpc branch has been merge to main, create a new tika-docker 
image for tika-grpc... same thing as tika-server but with the grpc runner 
instead of the tika rest services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860032#comment-17860032
 ] 

Nicholas DiPiazza commented on TIKA-4251:
-

I agree with Google format being the new standard, given that wildcard imports 
are set to .

 

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860011#comment-17860011
 ] 

Nicholas DiPiazza commented on TIKA-4251:
-

I volunteer to review the PR thoroughly. 

Here is how i will do it

1) use intellij to format the code using the checkstyle profile

2) use eclipse to format the code using the checkstyle profile

this is 2 different softwares doing the same thing.

diff the responses

should find minimal to no differences and helps guarantee confidence.

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860005#comment-17860005
 ] 

Nicholas DiPiazza edited comment on TIKA-4251 at 6/25/24 6:42 PM:
--

i guess we don't even need the maven plugin then.

we can use intellij to format all java source one time.

Then use the "format code" option in the git commit dialog so that you always 
have formatted commits (given that you used intellij to commit).

eclipse has this option as well to format on save. same thing as long as they 
are using eclipse, they will never have checkstyle issues.

this provides the "stop having checkstyle back-and-forth that wastes tons of 
time" issue


was (Author: ndipiazza):
i guess we don't even need the maven plugin then.

we can use intellij to format all java source one time.

Then use the "format code" option in the git commit dialog so that you always 
have formatted commits (given that you used intellij to commit).

this provides the "stop having checkstyle back-and-forth that wastes tons of 
time" issue

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860005#comment-17860005
 ] 

Nicholas DiPiazza edited comment on TIKA-4251 at 6/25/24 6:30 PM:
--

i guess we don't even need the maven plugin then.

we can use intellij to format all java source one time.

Then use the "format code" option in the git commit dialog so that you always 
have formatted commits (given that you used intellij to commit).

this provides the "stop having checkstyle back-and-forth that wastes tons of 
time" issue


was (Author: ndipiazza):
i guess we don't even need the maven plugin then.

we can use intellij to format all java source one time.

Then use the "format code" option in the git commit dialog so that you always 
have formatted commits (given that you used intellij to commit).

this provides the "stop having checkstyle back-and-forth that wastes tons of 
time) issue

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860005#comment-17860005
 ] 

Nicholas DiPiazza commented on TIKA-4251:
-

i guess we don't even need the maven plugin then.

we can use intellij to format all java source one time.

Then use the "format code" option in the git commit dialog so that you always 
have formatted commits (given that you used intellij to commit).

this provides the "stop having checkstyle back-and-forth that wastes tons of 
time) issue

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860004#comment-17860004
 ] 

Nicholas DiPiazza edited comment on TIKA-4251 at 6/25/24 6:28 PM:
--

I think as long as the plugin isn't transparently formatting code after commit, 
we are mitigating the risk.

This becomes a tool you can plugin to a git hook locally and it will produce 
PRs with formatted code that is going to be reviewed anyway. and the diffs 
should be very consumable because we eat the 1-time-format cost and now 
reformatting again should incur no additional changes.


was (Author: ndipiazza):
I think as long as the plugin isn't transparently formatting code after commit, 
we are mitigating the risk.

This becomes a tool you can plugin to a git hook locally and it will produce 
PRs with code that is going to be reviewed anyway. and the diffs should be very 
consumable because we eat the 1-time-format cost and now reformatting again 
should incur no additional changes.

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860004#comment-17860004
 ] 

Nicholas DiPiazza commented on TIKA-4251:
-

I think as long as the plugin isn't transparently formatting code after commit, 
we are mitigating the risk.

This becomes a tool you can plugin to a git hook locally and it will produce 
PRs with code that is going to be reviewed anyway. and the diffs should be very 
consumable because we eat the 1-time-format cost and now reformatting again 
should incur no additional changes.

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859987#comment-17859987
 ] 

Nicholas DiPiazza commented on TIKA-4181:
-

I will be merging this today. any issues let me know. 

> Tika Grpc Server using Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Create a Tika Grpc server.
> You should be able to create Tike Pipes fetchers, then use those fetchers. 
> You can then use those fetchers to FetchAndParse in 3 ways:
>  * synchronous fashion - you send a single request to fetch a file, and get a 
> single FetchAndParse response tuple.
>  * streaming output - you send a single request and stream back the 
> FetchAndParse response tuple.
>  * bi-directional streaming - You stream in 1 or more Fetch requests and 
> stream back FetchAndParse response tuples.
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4247) HttpFetcher - add ability to send request headers

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859985#comment-17859985
 ] 

Nicholas DiPiazza commented on TIKA-4247:
-

I will be merging this today. any follow-ups or issues let me know. 

> HttpFetcher - add ability to send request headers
> -
>
> Key: TIKA-4247
> URL: https://issues.apache.org/jira/browse/TIKA-4247
> Project: Tika
>  Issue Type: New Feature
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> add ability to send request headers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4237) Add JWT authentication ability to the http fetcher

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859984#comment-17859984
 ] 

Nicholas DiPiazza commented on TIKA-4237:
-

i will be merging this shortly. any issues let me know.

> Add JWT authentication ability to the http fetcher
> --
>
> Key: TIKA-4237
> URL: https://issues.apache.org/jira/browse/TIKA-4237
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 3.0.0-BETA
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add the ability to supply JWT
> support both HS256
> and  RS256



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4229) add microsoft graph fetcher

2024-06-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859980#comment-17859980
 ] 

Nicholas DiPiazza commented on TIKA-4229:
-

Will be merging this shortly. if anyone would like any changes let me know and 
I'll work the changes in over coming week or two,

> add microsoft graph fetcher
> ---
>
> Key: TIKA-4229
> URL: https://issues.apache.org/jira/browse/TIKA-4229
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> add a tika pipes fetcher capable of fetching files from MS graph api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-06-24 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4181:

Summary: Tika Grpc Server using Tika Pipes  (was: Grpc + Tika Pipes)

> Tika Grpc Server using Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Create a Tika Grpc server.
> You should be able to create Tike Pipes fetchers, then use those fetchers. 
> You can then use those fetchers to FetchAndParse in 3 ways:
>  * synchronous fashion - you send a single request to fetch a file, and get a 
> single FetchAndParse response tuple.
>  * streaming output - you send a single request and stream back the 
> FetchAndParse response tuple.
>  * bi-directional streaming - You stream in 1 or more Fetch requests and 
> stream back FetchAndParse response tuples.
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes

2024-06-24 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4181:

Description: 
Create a Tika Grpc server.

You should be able to create Tike Pipes fetchers, then use those fetchers. 

You can then use those fetchers to FetchAndParse in 3 ways:
 * synchronous fashion - you send a single request to fetch a file, and get a 
single FetchAndParse response tuple.
 * streaming output - you send a single request and stream back the 
FetchAndParse response tuple.
 * bi-directional streaming - You stream in 1 or more Fetch requests and stream 
back FetchAndParse response tuples.

Requires we create a service contract that specifies the inputs we require from 
each method.

Then we will need to implement the different components with a grpc client 
generated using the contract.

This would enable developers to run tika-pipes as a persistently running daemon 
instead of just a single batch app, because it can continue to stream out more 
inputs.

!image-2024-02-06-07-54-50-116.png!

  was:
Add full tika-pipes support of grpc
 * pipe iterator
 * fetcher
 * emitter

Requires we create a service contract that specifies the inputs we require from 
each method.

Then we will need to implement the different components with a grpc client 
generated using the contract.

This would enable developers to run tika-pipes as a persistently running daemon 
instead of just a single batch app, because it can continue to stream out more 
inputs.

!image-2024-02-06-07-54-50-116.png!


> Grpc + Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Create a Tika Grpc server.
> You should be able to create Tike Pipes fetchers, then use those fetchers. 
> You can then use those fetchers to FetchAndParse in 3 ways:
>  * synchronous fashion - you send a single request to fetch a file, and get a 
> single FetchAndParse response tuple.
>  * streaming output - you send a single request and stream back the 
> FetchAndParse response tuple.
>  * bi-directional streaming - You stream in 1 or more Fetch requests and 
> stream back FetchAndParse response tuples.
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes

2024-06-24 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4181:

Summary: Grpc + Tika Pipes  (was: Grpc + Tika Pipes - pipe iterator and 
emitter)

> Grpc + Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-24 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859757#comment-17859757
 ] 

Nicholas DiPiazza edited comment on TIKA-4251 at 6/24/24 6:35 PM:
--

we could keep everything how it is but:
 * provide instructions how to run the code formatter on the entire repo with 
google checkstyle.
 * run it on the entire codebase and commit the now-fully-formatted repo
 * advise everyone turn on the automatic code formatting in Intellij/Eclipse so 
that you automatically have your code formatted.

Now that plugin doesn't control us so much, but we still have easy way to stay 
fully formatted so we stop getting the back-and-forth with maven and CI when we 
forget to format something.

 


was (Author: ndipiazza):
we could keep everything how it is but:
 * provide instructions how to run the code formatter manually
 * run it on the entire codebase and commit the now-fully-formatted repo
 * advise everyone turn on the automatic code formatting in Intellij/Eclipse so 
that you automatically have your code formatted.

Now that plugin doesn't control us so much, but we still have easy way to stay 
fully formatted so we stop getting the back-and-forth with maven and CI when we 
forget to format something.

 

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-24 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859757#comment-17859757
 ] 

Nicholas DiPiazza commented on TIKA-4251:
-

we could keep everything how it is but:
 * provide instructions how to run the code formatter manually
 * run it on the entire codebase and commit the now-fully-formatted repo
 * advise everyone turn on the automatic code formatting in Intellij/Eclipse so 
that you automatically have your code formatted.

Now that plugin doesn't control us so much, but we still have easy way to stay 
fully formatted so we stop getting the back-and-forth with maven and CI when we 
forget to format something.

 

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-06 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852895#comment-17852895
 ] 

Nicholas DiPiazza commented on TIKA-4243:
-

 new ticket .let's close this out

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4243) tika configuration overhaul

2024-06-06 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza resolved TIKA-4243.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4264) Tika Pipes - Structured output (XHTML) support?

2024-05-28 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4264:
---

 Summary: Tika Pipes - Structured output (XHTML) support?
 Key: TIKA-4264
 URL: https://issues.apache.org/jira/browse/TIKA-4264
 Project: Tika
  Issue Type: Bug
  Components: tika-pipes
Reporter: Nicholas DiPiazza


So I am able to use Tika Pipes to extract the text content from a document.

But is it possible to use Tika Pipes to obtain structured documents? I believe 
Tika does this in XHTML.

The plain text extracted from the document is great for indexing into search 
engine. 

But if you want the structured text output like XHTML?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4262) In pipes XML config, List serializes incorrect causing the parameters to be empty when read

2024-05-26 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza closed TIKA-4262.
---
  Assignee: Nicholas DiPiazza
Resolution: Invalid

never mind - this was an issue in my branch reproducing in a crazy way.

> In pipes XML config, List serializes incorrect causing the parameters 
> to be empty when read
> ---
>
> Key: TIKA-4262
> URL: https://issues.apache.org/jira/browse/TIKA-4262
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Assignee: Nicholas DiPiazza
>Priority: Major
>
> tika configuration when saving a fetcher with a list of strings will look 
> like this:
>       []
>       [Authorization: xyz123]
> These are invalid format. It's expecting them to be:
>       
>       
>   Autorization: xyz123
>   
> So the effect of this is all List configs in fetchers are completely 
> ignored after being saved/re-read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4262) In pipes XML config, List serializes incorrect causing the parameters to be empty when read

2024-05-26 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4262:

Description: 
tika configuration when saving a fetcher with a list of strings will look like 
this:

      []
      [Authorization: xyz123]

These are invalid format. It's expecting them to be:

      
      
  Autorization: xyz123
  

So the effect of this is all List configs in fetchers are completely 
ignored after being saved/re-read.

  was:
tika configuration when saving a fetcher with a list of strings will look like 
this:

      []
      [Authorization: xyz123]

These are invalid format. It's expecting them to be:

      
      
  Autorization: xyz123
  

 


> In pipes XML config, List serializes incorrect causing the parameters 
> to be empty when read
> ---
>
> Key: TIKA-4262
> URL: https://issues.apache.org/jira/browse/TIKA-4262
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> tika configuration when saving a fetcher with a list of strings will look 
> like this:
>       []
>       [Authorization: xyz123]
> These are invalid format. It's expecting them to be:
>       
>       
>   Autorization: xyz123
>   
> So the effect of this is all List configs in fetchers are completely 
> ignored after being saved/re-read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4262) In pipes XML config, List serializes incorrect causing the parameters to be empty when read

2024-05-26 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4262:
---

 Summary: In pipes XML config, List serializes incorrect 
causing the parameters to be empty when read
 Key: TIKA-4262
 URL: https://issues.apache.org/jira/browse/TIKA-4262
 Project: Tika
  Issue Type: Bug
  Components: tika-pipes
Reporter: Nicholas DiPiazza


tika configuration when saving a fetcher with a list of strings will look like 
this:

      []
      [Authorization: xyz123]

These are invalid format. It's expecting them to be:

      
      
  Autorization: xyz123
  

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848960#comment-17848960
 ] 

Nicholas DiPiazza commented on TIKA-4243:
-

Sure that sounds good. When we chat later today/tomorrow let's discuss a high 
level plan here. I'll take my first stab at this Friday night 

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845083#comment-17845083
 ] 

Nicholas DiPiazza commented on TIKA-4252:
-

even better

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845080#comment-17845080
 ] 

Nicholas DiPiazza commented on TIKA-4252:
-

Maybe

 

fetchInputMetadata

outputMetadata

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845071#comment-17845071
 ] 

Nicholas DiPiazza commented on TIKA-4252:
-

sure I can do that.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845071#comment-17845071
 ] 

Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 5:08 PM:
-

sure I can do that. if you have a moment please do otherwise will get to it 
later in week next week


was (Author: ndipiazza):
sure I can do that.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845061#comment-17845061
 ] 

Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 4:50 PM:
-

What I need is to be able to send "Fetch Metadata" such as a bearer token to a 
single request

per-fetch-request variable.


was (Author: ndipiazza):
What I need is to be able to send "Fetch Metadata" such as a bearer token to a 
single request 

per-fetch-request varaible

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845061#comment-17845061
 ] 

Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 4:50 PM:
-

What I need is to be able to send "Fetch Metadata" such as a bearer token to a 
single fetch() request

per-fetch-request variable.


was (Author: ndipiazza):
What I need is to be able to send "Fetch Metadata" such as a bearer token to a 
single request

per-fetch-request variable.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845061#comment-17845061
 ] 

Nicholas DiPiazza commented on TIKA-4252:
-

What I need is to be able to send "Fetch Metadata" such as a bearer token to a 
single request 

per-fetch-request varaible

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza closed TIKA-4252.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845010#comment-17845010
 ] 

Nicholas DiPiazza commented on TIKA-4252:
-

done

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4252:

Description: 
when calling:

PipesResult pipesResult = pipesClient.process(new 
FetchEmitTuple(request.getFetchKey(),
                    new FetchKey(fetcher.getName(), request.getFetchKey()), new 
EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));

the tikaMetadata is not present in the fetch data when the fetch method is 
called.

 

It's OK through this part: 
            UnsynchronizedByteArrayOutputStream bos = 
UnsynchronizedByteArrayOutputStream.builder().get();
            try (ObjectOutputStream objectOutputStream = new 
ObjectOutputStream(bos))

{                 objectOutputStream.writeObject(t);             }

            byte[] bytes = bos.toByteArray();
            output.write(CALL.getByte());
            output.writeInt(bytes.length);
            output.write(bytes);
            output.flush();

 

i verified the bytes have the expected metadata from that point.

 

UPDATE: found issue

 

org.apache.tika.pipes.PipesServer#parseFromTuple

 

is using a new Metadata when it should only use empty metadata if fetch tuple 
metadata is null.

  was:
when calling:

PipesResult pipesResult = pipesClient.process(new 
FetchEmitTuple(request.getFetchKey(),
                    new FetchKey(fetcher.getName(), request.getFetchKey()), new 
EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));

the tikaMetadata is not present in the fetch data when the fetch method is 
called.

 

It's OK through this part: 
            UnsynchronizedByteArrayOutputStream bos = 
UnsynchronizedByteArrayOutputStream.builder().get();
            try (ObjectOutputStream objectOutputStream = new 
ObjectOutputStream(bos)) {
                objectOutputStream.writeObject(t);
            }

            byte[] bytes = bos.toByteArray();
            output.write(CALL.getByte());
            output.writeInt(bytes.length);
            output.write(bytes);
            output.flush();

 

i verified the bytes have the expected metadata from that point.


> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4252:

Description: 
when calling:

PipesResult pipesResult = pipesClient.process(new 
FetchEmitTuple(request.getFetchKey(),
                    new FetchKey(fetcher.getName(), request.getFetchKey()), new 
EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));

the tikaMetadata is not present in the fetch data when the fetch method is 
called.

 

It's OK through this part: 
            UnsynchronizedByteArrayOutputStream bos = 
UnsynchronizedByteArrayOutputStream.builder().get();
            try (ObjectOutputStream objectOutputStream = new 
ObjectOutputStream(bos)) {
                objectOutputStream.writeObject(t);
            }

            byte[] bytes = bos.toByteArray();
            output.write(CALL.getByte());
            output.writeInt(bytes.length);
            output.write(bytes);
            output.flush();

 

i verified the bytes have the expected metadata from that point.

  was:
when calling:

PipesResult pipesResult = pipesClient.process(new 
FetchEmitTuple(request.getFetchKey(),
                    new FetchKey(fetcher.getName(), request.getFetchKey()), new 
EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));

the tikaMetadata is not present in the fetch data when the fetch method is 
called.


> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos)) {
>                 objectOutputStream.writeObject(t);
>             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4252:
---

 Summary: PipesClient#process - seems to lose the Fetch input 
metadata?
 Key: TIKA-4252
 URL: https://issues.apache.org/jira/browse/TIKA-4252
 Project: Tika
  Issue Type: Bug
Reporter: Nicholas DiPiazza


when calling:

PipesResult pipesResult = pipesClient.process(new 
FetchEmitTuple(request.getFetchKey(),
                    new FetchKey(fetcher.getName(), request.getFetchKey()), new 
EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));

the tikaMetadata is not present in the fetch data when the fetch method is 
called.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-01 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842622#comment-17842622
 ] 

Nicholas DiPiazza commented on TIKA-4243:
-

Kinda seems like it might belong in tika-config module 

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-05-01 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842622#comment-17842622
 ] 

Nicholas DiPiazza edited comment on TIKA-4243 at 5/1/24 12:34 PM:
--

Kinda seems like it might belong in a new  tika-config module 


was (Author: ndipiazza):
Kinda seems like it might belong in tika-config module 

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-04-29 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842158#comment-17842158
 ] 

Nicholas DiPiazza edited comment on TIKA-4243 at 4/29/24 8:56 PM:
--

this seems like a major feature thing so i would recommend having it go with 
the tika 3.0.0 release 

makes sense if the tika 2.0.0 stays compatible


was (Author: ndipiazza):
this seems like a major feature thing so i would recommend with tika 3.x

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-29 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842158#comment-17842158
 ] 

Nicholas DiPiazza commented on TIKA-4243:
-

this seems like a major feature thing so i would recommend with tika 3.x

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-29 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842157#comment-17842157
 ] 

Nicholas DiPiazza commented on TIKA-4243:
-

[https://github.com/joelittlejohn/jsonschema2pojo 
|https://github.com/joelittlejohn/jsonschema2pojo] makes it so we can just 
author some .json schema files in *src/main/jsonschema* and it will 
automatically create Java files that are part of the classpath

It cuts down on unnecessary plumbing code by having to maintain both a JSON 
Schema file and Pojo.

So we get the benefits of JSON schema validation, and automatically generated 
pojos.

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4247) HttpFetcher - add ability to send request headers

2024-04-29 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4247:
---

 Summary: HttpFetcher - add ability to send request headers
 Key: TIKA-4247
 URL: https://issues.apache.org/jira/browse/TIKA-4247
 Project: Tika
  Issue Type: New Feature
Reporter: Nicholas DiPiazza


add ability to send request headers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4243) tika configuration overhaul

2024-04-24 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4243:
---

 Summary: tika configuration overhaul
 Key: TIKA-4243
 URL: https://issues.apache.org/jira/browse/TIKA-4243
 Project: Tika
  Issue Type: New Feature
  Components: config
Affects Versions: 3.0.0
Reporter: Nicholas DiPiazza


In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
Configuration schema. 

In 3.x can we remove the old way of doing configs and replace with Json Schema?

Json Schema can be converted to Pojos using a maven plugin 
[https://github.com/joelittlejohn/jsonschema2pojo]

This automatically creates a Java Pojo model we can use for the configs. 

This can allow for the legacy tika-config XML to be read and converted to the 
new pojos easily using an XML mapper so that users don't have to use JSON 
configurations yet if they do not want.

When complete, configurations can be set as XML, JSON or YAML

tika-config.xml

tika-config.json

tika-config.yaml

Replace all instances of tika config annotations that used the old syntax, and 
replace with the Pojo model serialized from the xml/json/yaml.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4243) tika configuration overhaul

2024-04-24 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4243:

Description: 
In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
Configuration schema. 

In 3.x can we remove the old way of doing configs and replace with Json Schema?

Json Schema can be converted to Pojos using a maven plugin 
[https://github.com/joelittlejohn/jsonschema2pojo]

This automatically creates a Java Pojo model we can use for the configs. 

This can allow for the legacy tika-config XML to be read and converted to the 
new pojos easily using an XML mapper so that users don't have to use JSON 
configurations yet if they do not want.

When complete, configurations can be set as XML, JSON or YAML

tika-config.xml

tika-config.json

tika-config.yaml

Replace all instances of tika config annotations that used the old syntax, and 
replace with the Pojo model serialized from the xml/json/yaml.

  was:
In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
Configuration schema. 

In 3.x can we remove the old way of doing configs and replace with Json Schema?

Json Schema can be converted to Pojos using a maven plugin 
[https://github.com/joelittlejohn/jsonschema2pojo]

This automatically creates a Java Pojo model we can use for the configs. 

This can allow for the legacy tika-config XML to be read and converted to the 
new pojos easily using an XML mapper so that users don't have to use JSON 
configurations yet if they do not want.

When complete, configurations can be set as XML, JSON or YAML

tika-config.xml

tika-config.json

tika-config.yaml

Replace all instances of tika config annotations that used the old syntax, and 
replace with the Pojo model serialized from the xml/json/yaml.

 

 

 


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4237) Add JWT authentication ability to the http fetcher

2024-04-05 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4237:
---

 Summary: Add JWT authentication ability to the http fetcher
 Key: TIKA-4237
 URL: https://issues.apache.org/jira/browse/TIKA-4237
 Project: Tika
  Issue Type: New Feature
  Components: tika-pipes
Affects Versions: 3.0.0-BETA
Reporter: Nicholas DiPiazza


Add the ability to supply JWT

support both HS256

and  RS256



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4229) add microsoft graph fetcher

2024-03-28 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4229:
---

 Summary: add microsoft graph fetcher
 Key: TIKA-4229
 URL: https://issues.apache.org/jira/browse/TIKA-4229
 Project: Tika
  Issue Type: New Feature
  Components: tika-pipes
Reporter: Nicholas DiPiazza


add a tika pipes fetcher capable of fetching files from MS graph api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-02-06 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4181:

Attachment: image-2024-02-06-07-54-50-116.png

> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-02-06 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4181:

Description: 
Add full tika-pipes support of grpc
 * pipe iterator
 * fetcher
 * emitter

Requires we create a service contract that specifies the inputs we require from 
each method.

Then we will need to implement the different components with a grpc client 
generated using the contract.

This would enable developers to run tika-pipes as a persistently running daemon 
instead of just a single batch app, because it can continue to stream out more 
inputs.

!image-2024-02-06-07-54-50-116.png!

  was:
Add full tika-pipes support of grpc
 * pipe iterator
 * fetcher
 * emitter

Requires we create a service contract that specifies the inputs we require from 
each method.

Then we will need to implement the different components with a grpc client 
generated using the contract.

This would enable developers to run tika-pipes as a persistently running daemon 
instead of just a single batch app, because it can continue to stream out more 
inputs.

 


> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-01-11 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805762#comment-17805762
 ] 

Nicholas DiPiazza edited comment on TIKA-4181 at 1/11/24 6:25 PM:
--

Tika pipes could get a full fledged service that could be tika-server-http2 to 
accompany tika-server and maybe one day replace it? 

Not sure the best way to handle packaging the app, but we could create a 
secondary main method for running the tika-pipes as a grpc service.

Then we would create a protobuf contract for each of the new services that we 
do:
 * pipe crud operations - create, update, delete, read, list, etc
 * run a pipe job - takes bidirectional streams of data - incoming=fetch 
metadata objects, outgoing=emitDocuments
 ** this will use a configured fetcher

So you would then provide a Go example and Java example generated from our 
protobuf schema.  that people could take and use

 

 


was (Author: ndipiazza):
Tika pipes could get a full fledged service that could be tika-server-http2 to 
accompany tika-server and maybe one day replace it? 

Not sure the best way to handle packaging the app, but we could create a 
secondary main method for running the tika-pipes as a grpc service.

Then we would create a protobuf contract for each of the new services that we 
do:
 * pipe crud operations - create, update, delete, read, list, etc
 * run a pipe job - takes bidirectional streams of data - incoming=fetch 
metadata objects, outgoing=emitDocuments

So you would then provide a Go example and Java example generated from our 
protobuf schema.  that people could take and use

 

 

> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-01-11 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805762#comment-17805762
 ] 

Nicholas DiPiazza commented on TIKA-4181:
-

Tika pipes could get a full fledged service that could be tika-server-http2 to 
accompany tika-server and maybe one day replace it? 

Not sure the best way to handle packaging the app, but we could create a 
secondary main method for running the tika-pipes as a grpc service.

Then we would create a protobuf contract for each of the new services that we 
do:
 * pipe crud operations - create, update, delete, read, list, etc
 * run a pipe job - takes bidirectional streams of data - incoming=fetch 
metadata objects, outgoing=emitDocuments

So you would then provide a Go example and Java example generated from our 
protobuf schema.  that people could take and use

 

 

> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-01-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4181:

Description: 
Add full tika-pipes support of grpc
 * pipe iterator
 * fetcher
 * emitter

Requires we create a service contract that specifies the inputs we require from 
each method.

Then we will need to implement the different components with a grpc client 
generated using the contract.

This would enable developers to run tika-pipes as a persistently running daemon 
instead of just a single batch app, because it can continue to stream out more 
inputs.

 

  was:
Add full tika-pipes support of grpc
 * pipe iterator
 * fetcher
 * emitter

Requires we create a service contract that specifies the inputs we require from 
each method.

Then we will need to implement the different components with a grpc client 
generated using the contract.

 


> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-01-11 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4181:
---

 Summary: Grpc + Tika Pipes - pipe iterator and emitter
 Key: TIKA-4181
 URL: https://issues.apache.org/jira/browse/TIKA-4181
 Project: Tika
  Issue Type: New Feature
  Components: tika-pipes
Reporter: Nicholas DiPiazza


Add full tika-pipes support of grpc
 * pipe iterator
 * fetcher
 * emitter

Requires we create a service contract that specifies the inputs we require from 
each method.

Then we will need to implement the different components with a grpc client 
generated using the contract.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3979) OneNoteParser - Improve performance for deserialization

2023-02-25 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3979:

Attachment: image-2023-02-25-12-01-40-311.png

> OneNoteParser - Improve performance for deserialization
> ---
>
> Key: TIKA-3979
> URL: https://issues.apache.org/jira/browse/TIKA-3979
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.7.0
>Reporter: David Xie
>Priority: Major
> Attachments: image-2023-02-20-14-42-10-590.png, 
> image-2023-02-25-12-01-40-311.png
>
>
> We noticed some performance issues specific to parsing OneNote files. Our cpu 
> profiler reports that the parser spends a lot of time on deserializing byte 
> arrays (image included below)
> !image-2023-02-20-14-42-10-590.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3979) OneNoteParser - Improve performance for deserialization

2023-02-25 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693512#comment-17693512
 ] 

Nicholas DiPiazza commented on TIKA-3979:
-

old and new appear to be the same binary equivalent 

so we are good here and i merged it 

!image-2023-02-25-12-01-40-311.png!

> OneNoteParser - Improve performance for deserialization
> ---
>
> Key: TIKA-3979
> URL: https://issues.apache.org/jira/browse/TIKA-3979
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.7.0
>Reporter: David Xie
>Priority: Major
> Attachments: image-2023-02-20-14-42-10-590.png, 
> image-2023-02-25-12-01-40-311.png
>
>
> We noticed some performance issues specific to parsing OneNote files. Our cpu 
> profiler reports that the parser spends a lot of time on deserializing byte 
> arrays (image included below)
> !image-2023-02-20-14-42-10-590.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text

2023-02-23 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17692989#comment-17692989
 ] 

Nicholas DiPiazza commented on TIKA-3970:
-

So on Windows PC I log into 

[https://account.microsoft.com/services/microsoft365/details#install]

Then click where it says Install Office

Eventually you should have a copy of office installed on your machine. Then you 
should be able to open all the files:

 

tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote.one
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote1.one
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote2.one
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote3.one
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote4.one
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote2007OrEarlier.one
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote2016.one
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNoteEmbeddedWordDoc.one
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNoteFromOffice365.one
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNoteFromOffice365-2.one
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/test-tika-3970-dupetext.one

> Certain OneNote documents produce duplicate text
> 
>
> Key: TIKA-3970
> URL: https://issues.apache.org/jira/browse/TIKA-3970
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 2.7.0
>Reporter: David Avant
>Priority: Minor
> Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, 
> lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, 
> lyrics.txt
>
>
> Extracting text from certain OneNote documents produces more text than is 
> actually in the document. In this case, the OneNote document was created 
> by opening a Word document and "printing" it to the OneNote.
> To reproduce the issue, open the attached "lyrics.one" using the Tika App 
> version 2.7.0 and view the plain text. Look for the phrase "Sunday 
> Morning" and observe that there are 14 occurrences.    However in the actual 
> displayed text, it occurs only once.  
> The original text in this document is only about 12K characters, but the 
> extracted text from tika is over 300K.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text

2023-02-23 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17692984#comment-17692984
 ] 

Nicholas DiPiazza commented on TIKA-3970:
-

> Should we reverse the iteration order of the pages? I notice that we're 
> getting page2 then page1 in one of our existing tests. So this might be a 
> feature or something we're missing in our implementation? I couldn't find 
> anything in the spec about this. Related: I noticed a "page number" property 
> for each of the page nodes in the attached file. Maybe we could use that info 
> to order the pages when it exists?

Yeah sure! That sounds like a really good idea.

> This would require some walking the tree and caching page order. I'm happy to 
> give it a try.

Yeah! that's what I spent a few hours doing with this PR above. I need to spend 
some more time on it probably i just kinda got the Jira's test case to work.

> Side note: I'm still really frustrated that I can't open a bunch of these 
> files in OneNote even after I set up my Microsoft account and save the files 
> in OneDrive.

Yeah so there are two types of OneNote files, the MS-ONESTORE spec, and the 
ones that use the alternative packaging MS-FSSHTTPD.

If you open a file from onenote office 365, it will use the alternative 
packaging. 

If you open a file from onenote from local microsoft office 365, it will use 
the ms-onestore spec.

So I think you might need to grab a copy of MS office: 
[https://support.microsoft.com/en-us/office/use-the-office-offline-installer-f0a85fe7-118f-41cb-a791-d59cef96ad1c]
 you could then work with this.

> Certain OneNote documents produce duplicate text
> 
>
> Key: TIKA-3970
> URL: https://issues.apache.org/jira/browse/TIKA-3970
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 2.7.0
>Reporter: David Avant
>Priority: Minor
> Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, 
> lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, 
> lyrics.txt
>
>
> Extracting text from certain OneNote documents produces more text than is 
> actually in the document. In this case, the OneNote document was created 
> by opening a Word document and "printing" it to the OneNote.
> To reproduce the issue, open the attached "lyrics.one" using the Tika App 
> version 2.7.0 and view the plain text. Look for the phrase "Sunday 
> Morning" and observe that there are 14 occurrences.    However in the actual 
> displayed text, it occurs only once.  
> The original text in this document is only about 12K characters, but the 
> extracted text from tika is over 300K.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3881) fix testAttachingADebuggerOnTheForkedParserShouldWork test - do not use hard coded port

2022-10-15 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-3881:
---

 Summary: fix testAttachingADebuggerOnTheForkedParserShouldWork 
test - do not use hard coded port
 Key: TIKA-3881
 URL: https://issues.apache.org/jira/browse/TIKA-3881
 Project: Tika
  Issue Type: Test
  Components: tika-app
Reporter: Nicholas DiPiazza


testAttachingADebuggerOnTheForkedParserShouldWork is using hard coded port. 
should look for available one instead and use it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3879) add test containers test for s3 fetcher, emitter and pipe iterators

2022-10-14 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza resolved TIKA-3879.
-
Resolution: Implemented

> add test containers test for s3 fetcher, emitter and pipe iterators
> ---
>
> Key: TIKA-3879
> URL: https://issues.apache.org/jira/browse/TIKA-3879
> Project: Tika
>  Issue Type: Test
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> need to add a testcontainers integration test for s3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3879) add test containers test for s3 fetcher, emitter and pipe iterators

2022-10-13 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-3879:
---

 Summary: add test containers test for s3 fetcher, emitter and pipe 
iterators
 Key: TIKA-3879
 URL: https://issues.apache.org/jira/browse/TIKA-3879
 Project: Tika
  Issue Type: Test
  Components: tika-pipes
Reporter: Nicholas DiPiazza


need to add a testcontainers integration test for s3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-09-07 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601463#comment-17601463
 ] 

Nicholas DiPiazza commented on TIKA-3835:
-

Yeah quickly realizing in my case, because i have solr already, it's better to 
just store the parsed output in solr than s3. though the s3 option is good. so 
an interface that supports caching where ever you want is good.

> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reduced to cache lookups.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  ** pipe iterator documents have an optional field: *cache* _boolean -_ 
> default=true. If cache=false, will not cache this doc.
>  * if parse cache is enabled, *cache* field != false, and parse cache 
> contains \{lastUpdated,docID}
>  ** Get \{lastUpdated,docID} document from the cache and push to the emit 
> queue and return.
>  * Parse document
>  * If parse cache is enabled, and *cache* field != false, put into cache 
> key=\{lastUpdated,docID}, value=\{document,metadata}
>  ** Additional conditions can dictate what documents we store in the cache 
> and what ones we don't bother. Such as numBytesInBody, etc.
> The cache would need to be disk or network based storage because of the 
> storage size. In-memory cache would not be feasible. 
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578666#comment-17578666
 ] 

Nicholas DiPiazza edited comment on TIKA-3835 at 8/11/22 8:53 PM:
--

[~tallison] i was wondering same thing. For now just taking a tuple for a key 
such as \{fileId, lastUpdatedOn}. we would then turn that into a UUID5 long 
integer, or just use some string form of it such as `\{fileId}|\{timestamp}`. 
some file sources actually have a checksum available (box.com has that) in 
those cases you could use the checksum as the parse cache key. 


was (Author: JIRAUSER294298):
[~tallison] i was wondering same thing. For now just taking a tuple for a key 
such as \{fileId, lastUpdatedOn}. we would then turn that into a UUID5 long 
integer, or just use some string form of it such as `\{fileId}|\{timestamp}`

> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reduced to cache lookups.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  ** pipe iterator documents have an optional field: *cache* _boolean -_ 
> default=true. If cache=false, will not cache this doc.
>  * if parse cache is enabled, *cache* field != false, and parse cache 
> contains \{lastUpdated,docID}
>  ** Get \{lastUpdated,docID} document from the cache and push to the emit 
> queue and return.
>  * Parse document
>  * If parse cache is enabled, and *cache* field != false, put into cache 
> key=\{lastUpdated,docID}, value=\{document,metadata}
>  ** Additional conditions can dictate what documents we store in the cache 
> and what ones we don't bother. Such as numBytesInBody, etc.
> The cache would need to be disk or network based storage because of the 
> storage size. In-memory cache would not be feasible. 
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578666#comment-17578666
 ] 

Nicholas DiPiazza edited comment on TIKA-3835 at 8/11/22 8:52 PM:
--

[~tallison] i was wondering same thing. For now just taking a tuple for a key 
such as \{fileId, lastUpdatedOn}. we would then turn that into a UUID5 long 
integer, or just use some string form of it such as `\{fileId}|\{timestamp}`


was (Author: JIRAUSER294298):
[~tallison] i was wondering same thing. For now just taking a tuple for a key 
such as \{fileId, lastUpdatedOn}. we would then turn that into a UUID5 long 
integer. 

> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reduced to cache lookups.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  ** pipe iterator documents have an optional field: *cache* _boolean -_ 
> default=true. If cache=false, will not cache this doc.
>  * if parse cache is enabled, *cache* field != false, and parse cache 
> contains \{lastUpdated,docID}
>  ** Get \{lastUpdated,docID} document from the cache and push to the emit 
> queue and return.
>  * Parse document
>  * If parse cache is enabled, and *cache* field != false, put into cache 
> key=\{lastUpdated,docID}, value=\{document,metadata}
>  ** Additional conditions can dictate what documents we store in the cache 
> and what ones we don't bother. Such as numBytesInBody, etc.
> The cache would need to be disk or network based storage because of the 
> storage size. In-memory cache would not be feasible. 
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578666#comment-17578666
 ] 

Nicholas DiPiazza commented on TIKA-3835:
-

[~tallison] i was wondering same thing. For now just taking a tuple for a key 
such as \{fileId, lastUpdatedOn}. we would then turn that into a UUID5 long 
integer. 

> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reduced to cache lookups.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  ** pipe iterator documents have an optional field: *cache* _boolean -_ 
> default=true. If cache=false, will not cache this doc.
>  * if parse cache is enabled, *cache* field != false, and parse cache 
> contains \{lastUpdated,docID}
>  ** Get \{lastUpdated,docID} document from the cache and push to the emit 
> queue and return.
>  * Parse document
>  * If parse cache is enabled, and *cache* field != false, put into cache 
> key=\{lastUpdated,docID}, value=\{document,metadata}
>  ** Additional conditions can dictate what documents we store in the cache 
> and what ones we don't bother. Such as numBytesInBody, etc.
> The cache would need to be disk or network based storage because of the 
> storage size. In-memory cache would not be feasible. 
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled, *cache* field != false, and parse cache contains 
\{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, and *cache* field != false, put into cache 
key=\{lastUpdated,docID}, value=\{document,metadata}
 ** Additional conditions can dictate what documents we store in the cache and 
what ones we don't bother. Such as numBytesInBody, etc.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, and *cache* field != false, put into cache 
key=\{lastUpdated,docID}, value=\{document,metadata}
 ** Additional conditions can dictate what documents we store in the cache and 
what ones we don't bother. Such as numBytesInBody, etc.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, and *cache* field != false, put into cache 
key=\{lastUpdated,docID}, value=\{document,metadata}
 ** Additional conditions can dictate what documents we store in the cache and 
what ones we don't bother. Such as numBytesInBody, etc.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, and *cache* field != false, put into cache 
key=\{lastUpdated,docID}, value=\{document,metadata}
 ** Additional conditions can dictate what documents we store in the cache and 
what ones we don't bother. Such as numBytesInBody, etc.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

A cache lookup HIT could be pushed could be done via a separate queue so that 
batching can be utilized asynchronously. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, and *cache* field != false, put into cache 
key=\{lastUpdated,docID}, value=\{document,metadata}
 ** Additional conditions can dictate what documents we store in the cache and 
what ones we don't bother. Such as numBytesInBody, etc.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

A cache lookup HIT could be pushed could be done via a separate queue so that 
batching can be utilized asynchronously. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, and *cache* field != false, put into cache 
key=\{lastUpdated,docID}, value=\{document,metadata}
 ** Additional conditions can dictate what documents we store in the cache and 
what ones we don't bother.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, and *cache* field != false, put into cache 
key=\{lastUpdated,docID}, value=\{document,metadata}
 ** Additional conditions can dictate what documents we store in the cache and 
what ones we don't bother.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  ** 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for people using services 
especially cloud file services with strict rate limits.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  

[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578591#comment-17578591
 ] 

Nicholas DiPiazza commented on TIKA-3835:
-

i added a bunch more edits. done. ha sorry if that spammed your email super 
heavily. 

> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reduced to cache lookups.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
>  ** Get \{lastUpdated,docID} document from the cache and push to the emit 
> queue and return.
>  * Parse document
>  * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
> value=\{document,metadata}
> The cache would need to be disk or network based storage because of the 
> storage size. In-memory cache would not be feasible. 
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for people using services 
especially cloud file services with strict rate limits.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The cache would need to be disk or network based storage because of the storage 
size.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The cache would need to be disk or network based storage because of the storage 
size.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reduced to 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reduced to cache lookups.
> So the process would be 
>  * pipe iterator has the next document: 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reducded to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reduced to cache lookups.
> So the process would be 
>  * pipe iterator has the next document: 

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reducded to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reducded to cache lookups.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
>  ** Emit to the document to the emit queue and return.
>  * Parse document
>  * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
> value=\{document,metadata}
> This will drastically improve full crawl times for customers using services 
> especially cloud file services with strict rate limits.
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578583#comment-17578583
 ] 

Nicholas DiPiazza edited comment on TIKA-3835 at 8/11/22 5:37 PM:
--

Yes good point. I didn't point out some important details I attempted to add 
them to the bottom of description. did that help clarify some? 


was (Author: JIRAUSER294298):
Yes good point. I didn't point out some important details I attempted to add 
them to the bottom of that. did that help clarify some? 

> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
>  ** Emit to the document to the emit queue and return.
>  * Parse document
>  * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
> value=\{document,metadata}
> This will drastically improve full crawl times for customers using services 
> especially cloud file services with strict rate limits.
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578583#comment-17578583
 ] 

Nicholas DiPiazza commented on TIKA-3835:
-

Yes good point. I didn't point out some important details I attempted to add 
them to the bottom of that. did that help clarify some? 

> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
>  ** Emit to the document to the emit queue and return.
>  * Parse document
>  * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
> value=\{document,metadata}
> This will drastically improve full crawl times for customers using services 
> especially cloud file services with strict rate limits.
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Description: 
Tika pipes should have an optional configuration to archive parsed results.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.


> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
>  ** Emit to the document to the emit queue and return.
>  * Parse document
>  * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
> value=\{document,metadata}
> This will drastically improve full crawl times for customers using services 
> especially cloud file services with strict rate limits.
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3835:

Summary: tika pipes parse cache - avoid re-parsing content that has not 
changed  (was: parse cache - avoid re-parsing content that has not changed)

> tika pipes parse cache - avoid re-parsing content that has not changed
> --
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.2.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
>  ** Emit to the document to the emit queue and return.
>  * Parse document
>  * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
> value=\{document,metadata}
> This will drastically improve full crawl times for customers using services 
> especially cloud file services with strict rate limits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3835) parse cache - avoid re-parsing content that has not changed

2022-08-11 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-3835:
---

 Summary: parse cache - avoid re-parsing content that has not 
changed
 Key: TIKA-3835
 URL: https://issues.apache.org/jira/browse/TIKA-3835
 Project: Tika
  Issue Type: New Feature
  Components: tika-pipes
Affects Versions: 2.2.0
Reporter: Nicholas DiPiazza


Tika pipes should have an optional configuration to archive parsed results.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3821) Pulsar Tika Pipes Support

2022-07-19 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-3821:
---

 Summary: Pulsar Tika Pipes Support
 Key: TIKA-3821
 URL: https://issues.apache.org/jira/browse/TIKA-3821
 Project: Tika
  Issue Type: New Feature
  Components: tika-pipes
Affects Versions: 2.4.1
Reporter: Nicholas DiPiazza


add kafka support to tika pipes:

* kafka pipe iterator
* kafka emitter





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3821) Pulsar Tika Pipes Support

2022-07-19 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-3821:

Description: 
add pulsar support to tika pipes:

* pulsar pipe iterator
* pulsar emitter



  was:
add kafka support to tika pipes:

* kafka pipe iterator
* kafka emitter




> Pulsar Tika Pipes Support
> -
>
> Key: TIKA-3821
> URL: https://issues.apache.org/jira/browse/TIKA-3821
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Affects Versions: 2.4.1
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> add pulsar support to tika pipes:
> * pulsar pipe iterator
> * pulsar emitter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3820) Kafka Tika Pipes Support

2022-07-18 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-3820:
---

 Summary: Kafka Tika Pipes Support
 Key: TIKA-3820
 URL: https://issues.apache.org/jira/browse/TIKA-3820
 Project: Tika
  Issue Type: New Feature
  Components: tika-pipes
Affects Versions: 2.4.1
Reporter: Nicholas DiPiazza


add kafka support to tika pipes:

* kafka pipe iterator
* kafka emitter





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)

2022-04-22 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526632#comment-17526632
 ] 

Nicholas DiPiazza edited comment on TIKA-3725 at 4/22/22 7:03 PM:
--

[~tallison] in my case I have a bunch of other deployments and statefulsets 
that are all using JWT across all inner-pod communication. so in my case having 
the ability to be consistent with those would be nice. 


was (Author: ndipiazza_gmail):
[~tallison] in my case I have a bunch of other deployments and statefulsets 
that are all using JWT to keep all inner-pod communication. so in my case 
having the ability to be consistent with those would be nice. 

> Add Authorization to Tika Server (Suggest Basic to start off with)
> --
>
> Key: TIKA-3725
> URL: https://issues.apache.org/jira/browse/TIKA-3725
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> I would be good to get some Authentication/Authorization added to TIKA server 
> to be able to add another layer of security around the Tika Server Rest 
> service.
> This could become a rabbit hole with the number of options available around 
> Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter 
> basic Auth is added. 
> How to store user(s)/password suggest looking at how other apache products do 
> the same?  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)

2022-04-22 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526632#comment-17526632
 ] 

Nicholas DiPiazza commented on TIKA-3725:
-

[~tallison] in my case I have a bunch of other deployments and statefulsets 
that are all using JWT to keep all inner-pod communication. so in my case 
having the ability to be consistent with those would be nice. 

> Add Authorization to Tika Server (Suggest Basic to start off with)
> --
>
> Key: TIKA-3725
> URL: https://issues.apache.org/jira/browse/TIKA-3725
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> I would be good to get some Authentication/Authorization added to TIKA server 
> to be able to add another layer of security around the Tika Server Rest 
> service.
> This could become a rabbit hole with the number of options available around 
> Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter 
> basic Auth is added. 
> How to store user(s)/password suggest looking at how other apache products do 
> the same?  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)

2022-04-22 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526557#comment-17526557
 ] 

Nicholas DiPiazza commented on TIKA-3725:
-

I am a couple weeks out of needing this too, and I'll need JWT auth. can add it 
if someone hasn't already. 

> Add Authorization to Tika Server (Suggest Basic to start off with)
> --
>
> Key: TIKA-3725
> URL: https://issues.apache.org/jira/browse/TIKA-3725
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> I would be good to get some Authentication/Authorization added to TIKA server 
> to be able to add another layer of security around the Tika Server Rest 
> service.
> This could become a rabbit hole with the number of options available around 
> Authentication/Authorization (Oauth, OpenId etc) so suggest as a starter 
> basic Auth is added. 
> How to store user(s)/password suggest looking at how other apache products do 
> the same?  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3659) SMB/NFS support

2022-01-22 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480514#comment-17480514
 ] 

Nicholas DiPiazza commented on TIKA-3659:
-

I will need to add a `smbj` client for SMB2/3 and `jcifs` for legacy SMB1 
shares soon for a project I'll be doing. Will probably have this done in the 
next few weeks. 

> SMB/NFS support
> ---
>
> Key: TIKA-3659
> URL: https://issues.apache.org/jira/browse/TIKA-3659
> Project: Tika
>  Issue Type: Wish
>  Components: handler, parser
>Affects Versions: 2.2.1
>Reporter: Michael
>Priority: Minor
>  Labels: features
>
> as referenced on 
> [https://discuss.opendistrocommunity.dev/t/alternative-to-fscrawler-in-opensearch/7157/11]
> please add support for the tika-pipes on SMB/NFS collections
>  
> Thank you in advance!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3446) OneNote - look into adding support for OneNote 365 documents

2021-12-08 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455997#comment-17455997
 ] 

Nicholas DiPiazza commented on TIKA-3446:
-

[~tallison] Do I need to do anything to make sure this gets in both 1.x and 
2.x? 

> OneNote - look into adding support for OneNote 365 documents
> 
>
> Key: TIKA-3446
> URL: https://issues.apache.org/jira/browse/TIKA-3446
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.27
>Reporter: Nicholas DiPiazza
>Assignee: Nicholas DiPiazza
>Priority: Major
>
> While doing some parsing of OneNote documents, I was investigating a slew of 
> them that did not seem to parse very well. 
> When I did some digging, I found out that these documents were generated from 
> SharePoint Online. 
> I had hoped that OneNote documents generated from SharePoint Online would 
> just be the same as OnPrem OneNote documents from 2016, 2019 etc. 
> But turns out this is NOT the case. 
> I checked out the Microsoft specification MS-ONESTORE and found that the 
> documents do not match the specifications that are published. 
> Opened a community post: [Looking for the MS spec for OneNote 365 version - 
> Microsoft 
> Q|https://docs.microsoft.com/en-us/answers/questions/436943/looking-for-the-ms-spec-for-onenote-365-version-1.html]
> And also opened an internal ticket with Microsoft. 
> They will be responding soon with an analysis of my issue and we'll see if 
> there is anything we can do. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3561) Tika throwing java.lang.OutOfMemoryError

2021-09-28 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421848#comment-17421848
 ] 

Nicholas DiPiazza edited comment on TIKA-3561 at 9/29/21, 1:06 AM:
---

Tika needs a lot of memory to parse a nested file like this, and time as well.

In order to give it a chance, you need to extract the file first.

Then I cranked up -Xmx20G and it took about 1.5 minutes to parse. 

eventually dumped out this parsed output (out.json compressed in the following 
tar ball)  [^out.tar.gz] 


was (Author: ndipiazza_gmail):
Tika needs a lot of memory to parse a nested file like this, and time as well.

I cranked up -Xmx20G and it took about 1.5 minutes to parse. 

eventually dumped out this parsed output (out.json compressed in the following 
tar ball)  [^out.tar.gz] 

> Tika throwing java.lang.OutOfMemoryError
> 
>
> Key: TIKA-3561
> URL: https://issues.apache.org/jira/browse/TIKA-3561
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.1.0
>Reporter: Abha
>Priority: Major
> Attachments: Item.zip, out.tar.gz
>
>
> Getting Fatal Exception when processing the attached document \{item.content 
> sub doc name is item.xlsx}.
> Below is the exception log -
> Caused by: java.lang.OutOfMemoryError: Java heap spaceCaused by: 
> java.lang.OutOfMemoryError: Java heap space at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:77) at 
> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:177) at 
> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149) at 
> org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.(ZipArchiveFakeEntry.java:47)
>  at 
> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:53)
>  at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106) at 
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307) at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>  at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   >