[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845207#comment-17845207
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza merged PR #1753:
URL: https://github.com/apache/tika/pull/1753




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4252: add request metadata [tika]

2024-05-09 Thread via GitHub


nddipiazza merged PR #1753:
URL: https://github.com/apache/tika/pull/1753


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845204#comment-17845204
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza opened a new pull request, #1753:
URL: https://github.com/apache/tika/pull/1753

   add request metadata




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4252: add request metadata [tika]

2024-05-09 Thread via GitHub


nddipiazza opened a new pull request, #1753:
URL: https://github.com/apache/tika/pull/1753

   add request metadata


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845083#comment-17845083
 ] 

Nicholas DiPiazza commented on TIKA-4252:
-

even better

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845081#comment-17845081
 ] 

Tim Allison commented on TIKA-4252:
---

fetchRequestMetadata, fetchResponseMetadata?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845080#comment-17845080
 ] 

Nicholas DiPiazza commented on TIKA-4252:
-

Maybe

 

fetchInputMetadata

outputMetadata

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845072#comment-17845072
 ] 

Tim Allison edited comment on TIKA-4252 at 5/9/24 5:14 PM:
---

fetcher.fetch(String key, Metadata writeMetadata, Metadata readMetadata) ?

where writeMetadata is what you want to send to the fetcher and readMetadata is 
the metadata as it currently is, e.g. metadata gathered from the fetcher and 
propagated through to the results?

Better names?

toMetadata, fromMetadata?


was (Author: talli...@mitre.org):
fetcher.fetch(String key, Metadata writeMetadata, Metadata readMetadata) ?

where writeMetadata is what you want to send to the fetcher and readMetadata is 
the metadata as it currently is, e.g. metadata gathered from the fetcher and 
propagated through to the results?

Better names?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845072#comment-17845072
 ] 

Tim Allison commented on TIKA-4252:
---

fetcher.fetch(String key, Metadata writeMetadata, Metadata readMetadata) ?

where writeMetadata is what you want to send to the fetcher and readMetadata is 
the metadata as it currently is, e.g. metadata gathered from the fetcher and 
propagated through to the results?

Better names?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845071#comment-17845071
 ] 

Nicholas DiPiazza commented on TIKA-4252:
-

sure I can do that.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845071#comment-17845071
 ] 

Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 5:08 PM:
-

sure I can do that. if you have a moment please do otherwise will get to it 
later in week next week


was (Author: ndipiazza):
sure I can do that.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845068#comment-17845068
 ] 

Tim Allison commented on TIKA-4252:
---

Should we add an optional Metadata object to the FetchKey. We could have this 
propagate through to the fetcher but never be confused with provenance data nor 
extracted content.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845062#comment-17845062
 ] 

Tim Allison commented on TIKA-4252:
---

K, but you don't want that coming back and being populated in the results, 
right?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845061#comment-17845061
 ] 

Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 4:50 PM:
-

What I need is to be able to send "Fetch Metadata" such as a bearer token to a 
single request

per-fetch-request variable.


was (Author: ndipiazza):
What I need is to be able to send "Fetch Metadata" such as a bearer token to a 
single request 

per-fetch-request varaible

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845061#comment-17845061
 ] 

Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 4:50 PM:
-

What I need is to be able to send "Fetch Metadata" such as a bearer token to a 
single fetch() request

per-fetch-request variable.


was (Author: ndipiazza):
What I need is to be able to send "Fetch Metadata" such as a bearer token to a 
single request

per-fetch-request variable.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845061#comment-17845061
 ] 

Nicholas DiPiazza commented on TIKA-4252:
-

What I need is to be able to send "Fetch Metadata" such as a bearer token to a 
single request 

per-fetch-request varaible

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845058#comment-17845058
 ] 

Hudson commented on TIKA-4252:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk11 #1624 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1624/])
TIKA-4252: fix metadata issue (#1752) (github: 
[https://github.com/apache/tika/commit/2f8dbdfbdf5c52160ecfc663dfb981fea527c72e])
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java


> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845057#comment-17845057
 ] 

Hudson commented on TIKA-4250:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk11 #1624 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1624/])
TIKA-4250 -- add optional parser for pst files -- wrapper for libpst/readpst 
(#1751) (github: 
[https://github.com/apache/tika/commit/32baf2345abe1a04d767ea6641a567d5c924587e])
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/LibPstParserConfig.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/LibPstParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/pom.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/org/apache/tika/parser/microsoft/libpst/tika-libpst-eml-config.xml
* (edit) CHANGES.txt
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/libpst/TestLibPstParser.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/org/apache/tika/parser/microsoft/libpst/tika-libpst-config.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/EmailVisitor.java


> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845051#comment-17845051
 ] 

Tim Allison commented on TIKA-4252:
---

Or, if you mean that metadata gathered from the fetcher isn't making it through 
into the results, I just added a few tests for that.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845048#comment-17845048
 ] 

Tim Allison commented on TIKA-4252:
---

My initial thought for injecting user metadata was to pass through provenance 
information etc into the final document/output.

I wanted to make sure that metadata extracted during the parse didn't overwrite 
user injected data so... I injected the user metadata _after_ the parse and 
after the metadata filters were applied.

[~ndipiazza], to confirm, you want to inject user metadata so that it is 
available for the fetchers?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845047#comment-17845047
 ] 

Tim Allison commented on TIKA-4252:
---

I opened this branch: https://github.com/apache/tika/tree/TIKA-4252

This reverts the change I suggested above and adds a unit test to confirm 
behavior that I incorrectly thought was reported as broken.

Now that I actually read this issue more carefully -- sorry -- it looks like 
the issue is that you want to pass user-injected metadata through to the 
fetcher. 

The problem is _NOT_ that you are not getting user-injected metadata back 
through the results.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-4252:
---

I pointed you to the wrong part of the code ... sorry. The design goal was to 
overwrite the extracted metadata with user metadata after the parse and before 
the emit.

This is what's leading to the new failing unit test in tika-server's 
testConcatenated();

I'm taking a look now.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4232 Create and execute unit tests for tika-helm [tika-helm]

2024-05-09 Thread via GitHub


lewismc commented on PR #17:
URL: https://github.com/apache/tika-helm/pull/17#issuecomment-2102889158

   PR updated to address prior blocker related to use of unapproved GitHub 
Actions. Waiting on https://issues.apache.org/jira/browse/INFRA-25775


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845031#comment-17845031
 ] 

ASF GitHub Bot commented on TIKA-4232:
--

lewismc commented on PR #17:
URL: https://github.com/apache/tika-helm/pull/17#issuecomment-2102889158

   PR updated to address prior blocker related to use of unapproved GitHub 
Actions. Waiting on https://issues.apache.org/jira/browse/INFRA-25775




> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


tesseract error failing build?

2024-05-09 Thread Nicholas DiPiazza
36.74 E: The repository '
https://ppa.launchpadcontent.net/alex-p/tesseract-ocr5/ubuntu noble
Release' does not have a Release file.

has anyone ever had this error before?

-nicholas


[jira] [Commented] (TIKA-4253) Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit tests

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845022#comment-17845022
 ] 

Tim Allison commented on TIKA-4253:
---

This is happening in the unit tests because there are multiple service loading 
files on the classpath in tika-parsers-standard from the different modules.

We could change the list to a set in 
ServiceLoader#identifyStaticServiceProviders.

> Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit 
> tests
> ---
>
> Key: TIKA-4253
> URL: https://issues.apache.org/jira/browse/TIKA-4253
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I haven't checked 2.x yet, but it looks like the AutoDetectParser with and 
> without a custom TikaConfig is loading parsers twice at least in 
> tika-parsers-standard unit tests.
> We should figure out if this is happening elsewhere in tika-app and 
> tika-server and fix it where we find it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4253) Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit tests

2024-05-09 Thread Tim Allison (Jira)
Tim Allison created TIKA-4253:
-

 Summary: Duplicate parsers loaded in AutoDetectParser in 3.x at 
least in some unit tests
 Key: TIKA-4253
 URL: https://issues.apache.org/jira/browse/TIKA-4253
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


I haven't checked 2.x yet, but it looks like the AutoDetectParser with and 
without a custom TikaConfig is loading parsers twice at least in 
tika-parsers-standard unit tests.

We should figure out if this is happening elsewhere in tika-app and tika-server 
and fix it where we find it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-05-09 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed TIKA-4233.
--

> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.3
>
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-05-09 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved TIKA-4233.

Resolution: Fixed

This PR broke one of the GitHub Action workflows. I have written to INFRA about 
it

https://issues.apache.org/jira/browse/INFRA-25775

> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.3
>
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-05-09 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-4233:
---
Fix Version/s: 2.9.3

> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.3
>
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845012#comment-17845012
 ] 

ASF GitHub Bot commented on TIKA-4250:
--

tballison merged PR #1751:
URL: https://github.com/apache/tika/pull/1751




> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4250 -- add optional parser for pst files -- wrapper for libpst/… [tika]

2024-05-09 Thread via GitHub


tballison merged PR #1751:
URL: https://github.com/apache/tika/pull/1751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza closed TIKA-4252.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845010#comment-17845010
 ] 

Nicholas DiPiazza commented on TIKA-4252:
-

done

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845005#comment-17845005
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza opened a new pull request, #1752:
URL: https://github.com/apache/tika/pull/1752

   * metadata was not getting sent to the fetch process




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845006#comment-17845006
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza merged PR #1752:
URL: https://github.com/apache/tika/pull/1752




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4252: fix metadata issue [tika]

2024-05-09 Thread via GitHub


nddipiazza merged PR #1752:
URL: https://github.com/apache/tika/pull/1752


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] TIKA-4252: fix metadata issue [tika]

2024-05-09 Thread via GitHub


nddipiazza opened a new pull request, #1752:
URL: https://github.com/apache/tika/pull/1752

   * metadata was not getting sent to the fetch process


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845003#comment-17845003
 ] 

ASF GitHub Bot commented on TIKA-4233:
--

lewismc merged PR #18:
URL: https://github.com/apache/tika-helm/pull/18




> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4233 Check tika-helm for deprecated k8s APIs [tika-helm]

2024-05-09 Thread via GitHub


lewismc merged PR #18:
URL: https://github.com/apache/tika-helm/pull/18


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4252:

Description: 
when calling:

PipesResult pipesResult = pipesClient.process(new 
FetchEmitTuple(request.getFetchKey(),
                    new FetchKey(fetcher.getName(), request.getFetchKey()), new 
EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));

the tikaMetadata is not present in the fetch data when the fetch method is 
called.

 

It's OK through this part: 
            UnsynchronizedByteArrayOutputStream bos = 
UnsynchronizedByteArrayOutputStream.builder().get();
            try (ObjectOutputStream objectOutputStream = new 
ObjectOutputStream(bos))

{                 objectOutputStream.writeObject(t);             }

            byte[] bytes = bos.toByteArray();
            output.write(CALL.getByte());
            output.writeInt(bytes.length);
            output.write(bytes);
            output.flush();

 

i verified the bytes have the expected metadata from that point.

 

UPDATE: found issue

 

org.apache.tika.pipes.PipesServer#parseFromTuple

 

is using a new Metadata when it should only use empty metadata if fetch tuple 
metadata is null.

  was:
when calling:

PipesResult pipesResult = pipesClient.process(new 
FetchEmitTuple(request.getFetchKey(),
                    new FetchKey(fetcher.getName(), request.getFetchKey()), new 
EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));

the tikaMetadata is not present in the fetch data when the fetch method is 
called.

 

It's OK through this part: 
            UnsynchronizedByteArrayOutputStream bos = 
UnsynchronizedByteArrayOutputStream.builder().get();
            try (ObjectOutputStream objectOutputStream = new 
ObjectOutputStream(bos)) {
                objectOutputStream.writeObject(t);
            }

            byte[] bytes = bos.toByteArray();
            output.write(CALL.getByte());
            output.writeInt(bytes.length);
            output.write(bytes);
            output.flush();

 

i verified the bytes have the expected metadata from that point.


> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844998#comment-17844998
 ] 

Tim Allison commented on TIKA-4252:
---

Good catch: 
https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java#L465

Shall I fix it or are you in progress?

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos)) {
>                 objectOutputStream.writeObject(t);
>             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4252:

Description: 
when calling:

PipesResult pipesResult = pipesClient.process(new 
FetchEmitTuple(request.getFetchKey(),
                    new FetchKey(fetcher.getName(), request.getFetchKey()), new 
EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));

the tikaMetadata is not present in the fetch data when the fetch method is 
called.

 

It's OK through this part: 
            UnsynchronizedByteArrayOutputStream bos = 
UnsynchronizedByteArrayOutputStream.builder().get();
            try (ObjectOutputStream objectOutputStream = new 
ObjectOutputStream(bos)) {
                objectOutputStream.writeObject(t);
            }

            byte[] bytes = bos.toByteArray();
            output.write(CALL.getByte());
            output.writeInt(bytes.length);
            output.write(bytes);
            output.flush();

 

i verified the bytes have the expected metadata from that point.

  was:
when calling:

PipesResult pipesResult = pipesClient.process(new 
FetchEmitTuple(request.getFetchKey(),
                    new FetchKey(fetcher.getName(), request.getFetchKey()), new 
EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));

the tikaMetadata is not present in the fetch data when the fetch method is 
called.


> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos)) {
>                 objectOutputStream.writeObject(t);
>             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844997#comment-17844997
 ] 

ASF GitHub Bot commented on TIKA-4250:
--

tballison opened a new pull request, #1751:
URL: https://github.com/apache/tika/pull/1751

   …readpst
   
   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4250 -- add optional parser for pst files -- wrapper for libpst/… [tika]

2024-05-09 Thread via GitHub


tballison opened a new pull request, #1751:
URL: https://github.com/apache/tika/pull/1751

   …readpst
   
   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Nicholas DiPiazza (Jira)
Nicholas DiPiazza created TIKA-4252:
---

 Summary: PipesClient#process - seems to lose the Fetch input 
metadata?
 Key: TIKA-4252
 URL: https://issues.apache.org/jira/browse/TIKA-4252
 Project: Tika
  Issue Type: Bug
Reporter: Nicholas DiPiazza


when calling:

PipesResult pipesResult = pipesClient.process(new 
FetchEmitTuple(request.getFetchKey(),
                    new FetchKey(fetcher.getName(), request.getFetchKey()), new 
EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));

the tikaMetadata is not present in the fetch data when the fetch method is 
called.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844976#comment-17844976
 ] 

Tim Allison edited comment on TIKA-4250 at 5/9/24 12:59 PM:


libpst issue opened: https://github.com/pst-format/libpst/issues/14



was (Author: talli...@mitre.org):
libpff issue opened: https://github.com/libyal/libpff/issues/128

Note that I found non-deterministic behavior even without debug on -- sometimes 
I got 7 extracted files, sometimes 8. I noted that in the issue. 

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844976#comment-17844976
 ] 

Tim Allison commented on TIKA-4250:
---

libpff issue opened: https://github.com/libyal/libpff/issues/128

Note that I found non-deterministic behavior even without debug on -- sometimes 
I got 7 extracted files, sometimes 8. I noted that in the issue. 

> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)