[jira] [Updated] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-06-24 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4181:

Summary: Tika Grpc Server using Tika Pipes  (was: Grpc + Tika Pipes)

> Tika Grpc Server using Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Create a Tika Grpc server.
> You should be able to create Tike Pipes fetchers, then use those fetchers. 
> You can then use those fetchers to FetchAndParse in 3 ways:
>  * synchronous fashion - you send a single request to fetch a file, and get a 
> single FetchAndParse response tuple.
>  * streaming output - you send a single request and stream back the 
> FetchAndParse response tuple.
>  * bi-directional streaming - You stream in 1 or more Fetch requests and 
> stream back FetchAndParse response tuples.
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes

2024-06-24 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4181:

Description: 
Create a Tika Grpc server.

You should be able to create Tike Pipes fetchers, then use those fetchers. 

You can then use those fetchers to FetchAndParse in 3 ways:
 * synchronous fashion - you send a single request to fetch a file, and get a 
single FetchAndParse response tuple.
 * streaming output - you send a single request and stream back the 
FetchAndParse response tuple.
 * bi-directional streaming - You stream in 1 or more Fetch requests and stream 
back FetchAndParse response tuples.

Requires we create a service contract that specifies the inputs we require from 
each method.

Then we will need to implement the different components with a grpc client 
generated using the contract.

This would enable developers to run tika-pipes as a persistently running daemon 
instead of just a single batch app, because it can continue to stream out more 
inputs.

!image-2024-02-06-07-54-50-116.png!

  was:
Add full tika-pipes support of grpc
 * pipe iterator
 * fetcher
 * emitter

Requires we create a service contract that specifies the inputs we require from 
each method.

Then we will need to implement the different components with a grpc client 
generated using the contract.

This would enable developers to run tika-pipes as a persistently running daemon 
instead of just a single batch app, because it can continue to stream out more 
inputs.

!image-2024-02-06-07-54-50-116.png!


> Grpc + Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Create a Tika Grpc server.
> You should be able to create Tike Pipes fetchers, then use those fetchers. 
> You can then use those fetchers to FetchAndParse in 3 ways:
>  * synchronous fashion - you send a single request to fetch a file, and get a 
> single FetchAndParse response tuple.
>  * streaming output - you send a single request and stream back the 
> FetchAndParse response tuple.
>  * bi-directional streaming - You stream in 1 or more Fetch requests and 
> stream back FetchAndParse response tuples.
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4181) Grpc + Tika Pipes

2024-06-24 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4181:

Summary: Grpc + Tika Pipes  (was: Grpc + Tika Pipes - pipe iterator and 
emitter)

> Grpc + Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4271) Support TCP liveness probe in Tika Helm chart

2024-06-24 Thread Bartek Ciszkowski (Jira)
Bartek Ciszkowski created TIKA-4271:
---

 Summary: Support TCP liveness probe in Tika Helm chart
 Key: TIKA-4271
 URL: https://issues.apache.org/jira/browse/TIKA-4271
 Project: Tika
  Issue Type: Improvement
Reporter: Bartek Ciszkowski


Currently, the tika-helm chart supplied by 
[https://github.com/apache/tika-helm] supports only an HTTP(S) probe.

However, the gRPC functionality in-dev within Tika requires a different sort of 
probe. We can check the port using TCP probe or we can use gRPC health check 
(included in the Tika gRPC functionality)

*Definition of done*

The Tika gRPC functionality can be deployed by the tika-helm chart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-24 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859757#comment-17859757
 ] 

Nicholas DiPiazza edited comment on TIKA-4251 at 6/24/24 6:35 PM:
--

we could keep everything how it is but:
 * provide instructions how to run the code formatter on the entire repo with 
google checkstyle.
 * run it on the entire codebase and commit the now-fully-formatted repo
 * advise everyone turn on the automatic code formatting in Intellij/Eclipse so 
that you automatically have your code formatted.

Now that plugin doesn't control us so much, but we still have easy way to stay 
fully formatted so we stop getting the back-and-forth with maven and CI when we 
forget to format something.

 


was (Author: ndipiazza):
we could keep everything how it is but:
 * provide instructions how to run the code formatter manually
 * run it on the entire codebase and commit the now-fully-formatted repo
 * advise everyone turn on the automatic code formatting in Intellij/Eclipse so 
that you automatically have your code formatted.

Now that plugin doesn't control us so much, but we still have easy way to stay 
fully formatted so we stop getting the back-and-forth with maven and CI when we 
forget to format something.

 

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-24 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859757#comment-17859757
 ] 

Nicholas DiPiazza commented on TIKA-4251:
-

we could keep everything how it is but:
 * provide instructions how to run the code formatter manually
 * run it on the entire codebase and commit the now-fully-formatted repo
 * advise everyone turn on the automatic code formatting in Intellij/Eclipse so 
that you automatically have your code formatted.

Now that plugin doesn't control us so much, but we still have easy way to stay 
fully formatted so we stop getting the back-and-forth with maven and CI when we 
forget to format something.

 

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859739#comment-17859739
 ] 

Tim Allison commented on TIKA-4251:
---

Y. I agree. When I started with checkstyle, it modified nearly every file. Any 
recs for mitigating this?

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Automatically applying checkstyle fixes

2024-06-24 Thread Nicholas DiPiazza
I agree! This will save me so much time

On Mon, Jun 24, 2024, 10:24 AM Tim Allison  wrote:

> Unless anyone has objections, I'm +1 to go forth and put it into tika's
> main branch via TIKA-4251.
>
> Can we use google's style and forbid wildcard imports?
>
> ref: https://docs.openrewrite.org/recipes/java/format/autoformat ?
>
> On Sat, Jun 22, 2024 at 11:37 AM Nicholas DiPiazza <
> nicholas.dipia...@gmail.com> wrote:
>
> > I just started using it for a big project and it is awesome
> >
> > On Sat, Jun 22, 2024, 6:11 AM Tim Allison  wrote:
> >
> > > https://issues.apache.org/jira/browse/TIKA-4251
> > >
> > > Anything that works and doesn't allow wildcard imports I'm good with.
> > Have
> > > you had luck with OpenRewrite?
> > >
> > > On Wed, Jun 19, 2024 at 12:55 PM Nicholas DiPiazza <
> > > nicholas.dipia...@gmail.com> wrote:
> > >
> > > > Hey Tim and Team:
> > > >
> > > > I remember someone stating at some point they were in the works of
> > > putting
> > > > in maven plugin to automatically correct checkstyle issues as they
> come
> > > up.
> > > >
> > > > I want it. I want it. I want it.
> > > >
> > > > Are you using OpenRewrite maven plugin to do it?
> > > >
> > > > -Nicholas
> > > >
> > >
> >
>


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-24 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859718#comment-17859718
 ] 

Tilman Hausherr commented on TIKA-4251:
---

I'm wondering if this means lots of changes to check at the beginning. This is 
the kindof plugin that would be ideal for a supply chain attack.

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Automatically applying checkstyle fixes

2024-06-24 Thread Tim Allison
Unless anyone has objections, I'm +1 to go forth and put it into tika's
main branch via TIKA-4251.

Can we use google's style and forbid wildcard imports?

ref: https://docs.openrewrite.org/recipes/java/format/autoformat ?

On Sat, Jun 22, 2024 at 11:37 AM Nicholas DiPiazza <
nicholas.dipia...@gmail.com> wrote:

> I just started using it for a big project and it is awesome
>
> On Sat, Jun 22, 2024, 6:11 AM Tim Allison  wrote:
>
> > https://issues.apache.org/jira/browse/TIKA-4251
> >
> > Anything that works and doesn't allow wildcard imports I'm good with.
> Have
> > you had luck with OpenRewrite?
> >
> > On Wed, Jun 19, 2024 at 12:55 PM Nicholas DiPiazza <
> > nicholas.dipia...@gmail.com> wrote:
> >
> > > Hey Tim and Team:
> > >
> > > I remember someone stating at some point they were in the works of
> > putting
> > > in maven plugin to automatically correct checkstyle issues as they come
> > up.
> > >
> > > I want it. I want it. I want it.
> > >
> > > Are you using OpenRewrite maven plugin to do it?
> > >
> > > -Nicholas
> > >
> >
>