[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845207#comment-17845207 ] ASF GitHub Bot commented on TIKA-4252: -- nddipiazza merged PR #1753: URL: https://github.com/apache/tika/pull/1753 > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4252: add request metadata [tika]
nddipiazza merged PR #1753: URL: https://github.com/apache/tika/pull/1753 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845204#comment-17845204 ] ASF GitHub Bot commented on TIKA-4252: -- nddipiazza opened a new pull request, #1753: URL: https://github.com/apache/tika/pull/1753 add request metadata > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] TIKA-4252: add request metadata [tika]
nddipiazza opened a new pull request, #1753: URL: https://github.com/apache/tika/pull/1753 add request metadata -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845083#comment-17845083 ] Nicholas DiPiazza commented on TIKA-4252: - even better > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845081#comment-17845081 ] Tim Allison commented on TIKA-4252: --- fetchRequestMetadata, fetchResponseMetadata? > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845080#comment-17845080 ] Nicholas DiPiazza commented on TIKA-4252: - Maybe fetchInputMetadata outputMetadata > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845072#comment-17845072 ] Tim Allison edited comment on TIKA-4252 at 5/9/24 5:14 PM: --- fetcher.fetch(String key, Metadata writeMetadata, Metadata readMetadata) ? where writeMetadata is what you want to send to the fetcher and readMetadata is the metadata as it currently is, e.g. metadata gathered from the fetcher and propagated through to the results? Better names? toMetadata, fromMetadata? was (Author: talli...@mitre.org): fetcher.fetch(String key, Metadata writeMetadata, Metadata readMetadata) ? where writeMetadata is what you want to send to the fetcher and readMetadata is the metadata as it currently is, e.g. metadata gathered from the fetcher and propagated through to the results? Better names? > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845072#comment-17845072 ] Tim Allison commented on TIKA-4252: --- fetcher.fetch(String key, Metadata writeMetadata, Metadata readMetadata) ? where writeMetadata is what you want to send to the fetcher and readMetadata is the metadata as it currently is, e.g. metadata gathered from the fetcher and propagated through to the results? Better names? > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845071#comment-17845071 ] Nicholas DiPiazza commented on TIKA-4252: - sure I can do that. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845071#comment-17845071 ] Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 5:08 PM: - sure I can do that. if you have a moment please do otherwise will get to it later in week next week was (Author: ndipiazza): sure I can do that. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845068#comment-17845068 ] Tim Allison commented on TIKA-4252: --- Should we add an optional Metadata object to the FetchKey. We could have this propagate through to the fetcher but never be confused with provenance data nor extracted content. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845062#comment-17845062 ] Tim Allison commented on TIKA-4252: --- K, but you don't want that coming back and being populated in the results, right? > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845061#comment-17845061 ] Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 4:50 PM: - What I need is to be able to send "Fetch Metadata" such as a bearer token to a single request per-fetch-request variable. was (Author: ndipiazza): What I need is to be able to send "Fetch Metadata" such as a bearer token to a single request per-fetch-request varaible > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845061#comment-17845061 ] Nicholas DiPiazza edited comment on TIKA-4252 at 5/9/24 4:50 PM: - What I need is to be able to send "Fetch Metadata" such as a bearer token to a single fetch() request per-fetch-request variable. was (Author: ndipiazza): What I need is to be able to send "Fetch Metadata" such as a bearer token to a single request per-fetch-request variable. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845061#comment-17845061 ] Nicholas DiPiazza commented on TIKA-4252: - What I need is to be able to send "Fetch Metadata" such as a bearer token to a single request per-fetch-request varaible > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845058#comment-17845058 ] Hudson commented on TIKA-4252: -- UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk11 #1624 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1624/]) TIKA-4252: fix metadata issue (#1752) (github: [https://github.com/apache/tika/commit/2f8dbdfbdf5c52160ecfc663dfb981fea527c72e]) * (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845057#comment-17845057 ] Hudson commented on TIKA-4250: -- UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk11 #1624 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1624/]) TIKA-4250 -- add optional parser for pst files -- wrapper for libpst/readpst (#1751) (github: [https://github.com/apache/tika/commit/32baf2345abe1a04d767ea6641a567d5c924587e]) * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/LibPstParserConfig.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/LibPstParser.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/pom.xml * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/org/apache/tika/parser/microsoft/libpst/tika-libpst-eml-config.xml * (edit) CHANGES.txt * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/libpst/TestLibPstParser.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/org/apache/tika/parser/microsoft/libpst/tika-libpst-config.xml * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/libpst/EmailVisitor.java > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845051#comment-17845051 ] Tim Allison commented on TIKA-4252: --- Or, if you mean that metadata gathered from the fetcher isn't making it through into the results, I just added a few tests for that. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845048#comment-17845048 ] Tim Allison commented on TIKA-4252: --- My initial thought for injecting user metadata was to pass through provenance information etc into the final document/output. I wanted to make sure that metadata extracted during the parse didn't overwrite user injected data so... I injected the user metadata _after_ the parse and after the metadata filters were applied. [~ndipiazza], to confirm, you want to inject user metadata so that it is available for the fetchers? > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845047#comment-17845047 ] Tim Allison commented on TIKA-4252: --- I opened this branch: https://github.com/apache/tika/tree/TIKA-4252 This reverts the change I suggested above and adds a unit test to confirm behavior that I incorrectly thought was reported as broken. Now that I actually read this issue more carefully -- sorry -- it looks like the issue is that you want to pass user-injected metadata through to the fetcher. The problem is _NOT_ that you are not getting user-injected metadata back through the results. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-4252: --- I pointed you to the wrong part of the code ... sorry. The design goal was to overwrite the extracted metadata with user metadata after the parse and before the emit. This is what's leading to the new failing unit test in tika-server's testConcatenated(); I'm taking a look now. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4232 Create and execute unit tests for tika-helm [tika-helm]
lewismc commented on PR #17: URL: https://github.com/apache/tika-helm/pull/17#issuecomment-2102889158 PR updated to address prior blocker related to use of unapproved GitHub Actions. Waiting on https://issues.apache.org/jira/browse/INFRA-25775 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm
[ https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845031#comment-17845031 ] ASF GitHub Bot commented on TIKA-4232: -- lewismc commented on PR #17: URL: https://github.com/apache/tika-helm/pull/17#issuecomment-2102889158 PR updated to address prior blocker related to use of unapproved GitHub Actions. Waiting on https://issues.apache.org/jira/browse/INFRA-25775 > Create and execute unit tests for tika-helm > --- > > Key: TIKA-4232 > URL: https://issues.apache.org/jira/browse/TIKA-4232 > Project: Tika > Issue Type: Improvement > Components: tika-helm >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > > The goal is to execute chart unit tests against each tika-helm pull request. > I found the [Helm Unit > Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action > which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin. > The PR will consist of one or more unit tests automated via the GitHub action. -- This message was sent by Atlassian Jira (v8.20.10#820010)
tesseract error failing build?
36.74 E: The repository ' https://ppa.launchpadcontent.net/alex-p/tesseract-ocr5/ubuntu noble Release' does not have a Release file. has anyone ever had this error before? -nicholas
[jira] [Commented] (TIKA-4253) Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit tests
[ https://issues.apache.org/jira/browse/TIKA-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845022#comment-17845022 ] Tim Allison commented on TIKA-4253: --- This is happening in the unit tests because there are multiple service loading files on the classpath in tika-parsers-standard from the different modules. We could change the list to a set in ServiceLoader#identifyStaticServiceProviders. > Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit > tests > --- > > Key: TIKA-4253 > URL: https://issues.apache.org/jira/browse/TIKA-4253 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I haven't checked 2.x yet, but it looks like the AutoDetectParser with and > without a custom TikaConfig is loading parsers twice at least in > tika-parsers-standard unit tests. > We should figure out if this is happening elsewhere in tika-app and > tika-server and fix it where we find it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4253) Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit tests
Tim Allison created TIKA-4253: - Summary: Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit tests Key: TIKA-4253 URL: https://issues.apache.org/jira/browse/TIKA-4253 Project: Tika Issue Type: Task Reporter: Tim Allison I haven't checked 2.x yet, but it looks like the AutoDetectParser with and without a custom TikaConfig is loading parsers twice at least in tika-parsers-standard unit tests. We should figure out if this is happening elsewhere in tika-app and tika-server and fix it where we find it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4233) Check tika-helm for deprecated k8s APIs
[ https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed TIKA-4233. -- > Check tika-helm for deprecated k8s APIs > --- > > Key: TIKA-4233 > URL: https://issues.apache.org/jira/browse/TIKA-4233 > Project: Tika > Issue Type: New Feature > Components: tika-helm >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 2.9.3 > > > It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for > this would be ideal. The “Check deprecated k8s APIs” GitHub action > accomplishes this. > [https://github.com/marketplace/actions/check-deprecated-k8s-apis] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4233) Check tika-helm for deprecated k8s APIs
[ https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved TIKA-4233. Resolution: Fixed This PR broke one of the GitHub Action workflows. I have written to INFRA about it https://issues.apache.org/jira/browse/INFRA-25775 > Check tika-helm for deprecated k8s APIs > --- > > Key: TIKA-4233 > URL: https://issues.apache.org/jira/browse/TIKA-4233 > Project: Tika > Issue Type: New Feature > Components: tika-helm >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 2.9.3 > > > It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for > this would be ideal. The “Check deprecated k8s APIs” GitHub action > accomplishes this. > [https://github.com/marketplace/actions/check-deprecated-k8s-apis] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4233) Check tika-helm for deprecated k8s APIs
[ https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated TIKA-4233: --- Fix Version/s: 2.9.3 > Check tika-helm for deprecated k8s APIs > --- > > Key: TIKA-4233 > URL: https://issues.apache.org/jira/browse/TIKA-4233 > Project: Tika > Issue Type: New Feature > Components: tika-helm >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 2.9.3 > > > It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for > this would be ideal. The “Check deprecated k8s APIs” GitHub action > accomplishes this. > [https://github.com/marketplace/actions/check-deprecated-k8s-apis] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845012#comment-17845012 ] ASF GitHub Bot commented on TIKA-4250: -- tballison merged PR #1751: URL: https://github.com/apache/tika/pull/1751 > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4250 -- add optional parser for pst files -- wrapper for libpst/… [tika]
tballison merged PR #1751: URL: https://github.com/apache/tika/pull/1751 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza closed TIKA-4252. --- Fix Version/s: 3.0.0 Resolution: Fixed > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845010#comment-17845010 ] Nicholas DiPiazza commented on TIKA-4252: - done > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845005#comment-17845005 ] ASF GitHub Bot commented on TIKA-4252: -- nddipiazza opened a new pull request, #1752: URL: https://github.com/apache/tika/pull/1752 * metadata was not getting sent to the fetch process > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845006#comment-17845006 ] ASF GitHub Bot commented on TIKA-4252: -- nddipiazza merged PR #1752: URL: https://github.com/apache/tika/pull/1752 > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4252: fix metadata issue [tika]
nddipiazza merged PR #1752: URL: https://github.com/apache/tika/pull/1752 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] TIKA-4252: fix metadata issue [tika]
nddipiazza opened a new pull request, #1752: URL: https://github.com/apache/tika/pull/1752 * metadata was not getting sent to the fetch process -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4233) Check tika-helm for deprecated k8s APIs
[ https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845003#comment-17845003 ] ASF GitHub Bot commented on TIKA-4233: -- lewismc merged PR #18: URL: https://github.com/apache/tika-helm/pull/18 > Check tika-helm for deprecated k8s APIs > --- > > Key: TIKA-4233 > URL: https://issues.apache.org/jira/browse/TIKA-4233 > Project: Tika > Issue Type: New Feature > Components: tika-helm >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > > It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for > this would be ideal. The “Check deprecated k8s APIs” GitHub action > accomplishes this. > [https://github.com/marketplace/actions/check-deprecated-k8s-apis] -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4233 Check tika-helm for deprecated k8s APIs [tika-helm]
lewismc merged PR #18: URL: https://github.com/apache/tika-helm/pull/18 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4252: Description: when calling: PipesResult pipesResult = pipesClient.process(new FetchEmitTuple(request.getFetchKey(), new FetchKey(fetcher.getName(), request.getFetchKey()), new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); the tikaMetadata is not present in the fetch data when the fetch method is called. It's OK through this part: UnsynchronizedByteArrayOutputStream bos = UnsynchronizedByteArrayOutputStream.builder().get(); try (ObjectOutputStream objectOutputStream = new ObjectOutputStream(bos)) { objectOutputStream.writeObject(t); } byte[] bytes = bos.toByteArray(); output.write(CALL.getByte()); output.writeInt(bytes.length); output.write(bytes); output.flush(); i verified the bytes have the expected metadata from that point. UPDATE: found issue org.apache.tika.pipes.PipesServer#parseFromTuple is using a new Metadata when it should only use empty metadata if fetch tuple metadata is null. was: when calling: PipesResult pipesResult = pipesClient.process(new FetchEmitTuple(request.getFetchKey(), new FetchKey(fetcher.getName(), request.getFetchKey()), new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); the tikaMetadata is not present in the fetch data when the fetch method is called. It's OK through this part: UnsynchronizedByteArrayOutputStream bos = UnsynchronizedByteArrayOutputStream.builder().get(); try (ObjectOutputStream objectOutputStream = new ObjectOutputStream(bos)) { objectOutputStream.writeObject(t); } byte[] bytes = bos.toByteArray(); output.write(CALL.getByte()); output.writeInt(bytes.length); output.write(bytes); output.flush(); i verified the bytes have the expected metadata from that point. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844998#comment-17844998 ] Tim Allison commented on TIKA-4252: --- Good catch: https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java#L465 Shall I fix it or are you in progress? > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) { > objectOutputStream.writeObject(t); > } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas DiPiazza updated TIKA-4252: Description: when calling: PipesResult pipesResult = pipesClient.process(new FetchEmitTuple(request.getFetchKey(), new FetchKey(fetcher.getName(), request.getFetchKey()), new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); the tikaMetadata is not present in the fetch data when the fetch method is called. It's OK through this part: UnsynchronizedByteArrayOutputStream bos = UnsynchronizedByteArrayOutputStream.builder().get(); try (ObjectOutputStream objectOutputStream = new ObjectOutputStream(bos)) { objectOutputStream.writeObject(t); } byte[] bytes = bos.toByteArray(); output.write(CALL.getByte()); output.writeInt(bytes.length); output.write(bytes); output.flush(); i verified the bytes have the expected metadata from that point. was: when calling: PipesResult pipesResult = pipesClient.process(new FetchEmitTuple(request.getFetchKey(), new FetchKey(fetcher.getName(), request.getFetchKey()), new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); the tikaMetadata is not present in the fetch data when the fetch method is called. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) { > objectOutputStream.writeObject(t); > } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844997#comment-17844997 ] ASF GitHub Bot commented on TIKA-4250: -- tballison opened a new pull request, #1751: URL: https://github.com/apache/tika/pull/1751 …readpst Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] TIKA-4250 -- add optional parser for pst files -- wrapper for libpst/… [tika]
tballison opened a new pull request, #1751: URL: https://github.com/apache/tika/pull/1751 …readpst Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
Nicholas DiPiazza created TIKA-4252: --- Summary: PipesClient#process - seems to lose the Fetch input metadata? Key: TIKA-4252 URL: https://issues.apache.org/jira/browse/TIKA-4252 Project: Tika Issue Type: Bug Reporter: Nicholas DiPiazza when calling: PipesResult pipesResult = pipesClient.process(new FetchEmitTuple(request.getFetchKey(), new FetchKey(fetcher.getName(), request.getFetchKey()), new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); the tikaMetadata is not present in the fetch data when the fetch method is called. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844976#comment-17844976 ] Tim Allison edited comment on TIKA-4250 at 5/9/24 12:59 PM: libpst issue opened: https://github.com/pst-format/libpst/issues/14 was (Author: talli...@mitre.org): libpff issue opened: https://github.com/libyal/libpff/issues/128 Note that I found non-deterministic behavior even without debug on -- sometimes I got 7 extracted files, sometimes 8. I noted that in the issue. > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4250) Add a libpst-based parser
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844976#comment-17844976 ] Tim Allison commented on TIKA-4250: --- libpff issue opened: https://github.com/libyal/libpff/issues/128 Note that I found non-deterministic behavior even without debug on -- sometimes I got 7 extracted files, sometimes 8. I noted that in the issue. > Add a libpst-based parser > - > > Key: TIKA-4250 > URL: https://issues.apache.org/jira/browse/TIKA-4250 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 8.eml, 8.msg > > > We currently use the com.pff Java-based PST parser for PST files. It would be > useful to add a wrapper for libpst as an optional parser. > One of the benefits of libpst is that it creates .eml or .msg files from the > PST records. This is critical for those who want the original bytes from > embedded files. Obv, PST doesn't store eml or msg, but some users want the > "original" emails even if they are constructed from PST records. -- This message was sent by Atlassian Jira (v8.20.10#820010)