[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853238#comment-17853238
 ] 

ASF GitHub Bot commented on TIKA-4243:
--

tballison opened a new pull request, #1805:
URL: https://github.com/apache/tika/pull/1805

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-06-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852865#comment-17852865
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

tballison merged PR #1778:
URL: https://github.com/apache/tika/pull/1778




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851779#comment-17851779
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

tballison commented on PR #1778:
URL: https://github.com/apache/tika/pull/1778#issuecomment-2145904427

   Ha, @nddipiazza. I did earlier this morning. I chose your choices over mine 
in the merge, largely.
   
   See 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17851727#comment-17851727
   
   What we now need to do is figure out how to serialize+deserialize 
ParseContext with as little work as possible. :D
   




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851777#comment-17851777
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza commented on PR #1778:
URL: https://github.com/apache/tika/pull/1778#issuecomment-2145900710

   sure will do @tballison sorry didn't see this until now 




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851587#comment-17851587
 ] 

ASF GitHub Bot commented on TIKA-4260:
--

tballison closed pull request #1776: TIKA-4260 

> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851588#comment-17851588
 ] 

ASF GitHub Bot commented on TIKA-4260:
--

tballison commented on PR #1776:
URL: https://github.com/apache/tika/pull/1776#issuecomment-2144945579

   I merged this into @nddipiazza 's TIKA-4252 PR.




> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4220) Commons-compress too lenient on headless tar detection

2024-05-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850756#comment-17850756
 ] 

ASF GitHub Bot commented on TIKA-4220:
--

tballison merged PR #1790:
URL: https://github.com/apache/tika/pull/1790




> Commons-compress too lenient on headless tar detection
> --
>
> Key: TIKA-4220
> URL: https://issues.apache.org/jira/browse/TIKA-4220
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> On recent regression tests on TIKA-4218, we noticed a fairly major change 
> with an increased rate of false positives on headless tar detection from 
> commons-compress.
> I think for now we should copy/paste/fork the headless tar detection and 
> improve it/revert it or possibly remove it for our 2.9.2 release.
> On this ticket, I'll look into what changed recently in headless tar 
> detection in commons-compress and experiment with fixing it.
> One challenge is that our magic bytes detection happens _after_ our custom 
> detectors, which means that we can't put a low confidence on what comes out 
> of our custom detectors and let the magic detection fix it. We could  
> implement an x-tar special case, but I really don't like that.
> Let's see what we can do...
> The numbers below represent the number of files identified as A (in tika 
> 2.9.1) -> B (in tika-2.9.2-pre-rc1).
> application/octet-stream -> application/x-tar 826
> multipart/appledouble -> application/x-tar701
> image/x-tga -> application/x-tar  322
> image/vnd.microsoft.icon -> application/x-tar 312
> application/vnd.iccprofile -> application/x-tar   221
> video/mp4 -> application/x-tar177
> audio/mpeg -> application/x-tar   59
> video/x-m4v -> application/x-tar  59
> application/x-font-printer-metric -> application/x-tar36
> audio/mp4 -> application/x-tar25
> application/x-tex-tfm -> application/x-tar18
> image/x-pict -> application/x-tar 15
> image/png -> application/x-tar8
> text/plain; charset=ISO-8859-1 -> application/x-tar   8
> application/x-endnote-style -> application/x-tar  7
> application/x-font-ttf -> application/x-tar   6



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4229) add microsoft graph fetcher

2024-05-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850708#comment-17850708
 ] 

ASF GitHub Bot commented on TIKA-4229:
--

bartek commented on code in PR #1698:
URL: https://github.com/apache/tika/pull/1698#discussion_r1620663162


##
tika-pipes/tika-fetchers/tika-fetcher-microsoft-graph/src/main/java/org/apache/tika/pipes/fetchers/microsoftgraph/MicrosoftGraphFetcher.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.pipes.fetchers.microsoftgraph;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Map;
+
+import com.azure.identity.ClientCertificateCredentialBuilder;
+import com.azure.identity.ClientSecretCredentialBuilder;
+import com.microsoft.graph.serviceclient.GraphServiceClient;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.tika.config.Field;
+import org.apache.tika.config.Initializable;
+import org.apache.tika.config.InitializableProblemHandler;
+import org.apache.tika.config.Param;
+import org.apache.tika.exception.TikaConfigException;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.pipes.fetcher.AbstractFetcher;
+import 
org.apache.tika.pipes.fetchers.microsoftgraph.config.ClientCertificateCredentialsConfig;
+import 
org.apache.tika.pipes.fetchers.microsoftgraph.config.ClientSecretCredentialsConfig;
+import 
org.apache.tika.pipes.fetchers.microsoftgraph.config.MsGraphFetcherConfig;
+
+/**
+ * Fetches files from Microsoft Graph API.
+ * Fetch keys are ${siteDriveId},${driveItemId}
+ */
+public class MicrosoftGraphFetcher extends AbstractFetcher implements 
Initializable {
+private static final Logger LOGGER = 
LoggerFactory.getLogger(MicrosoftGraphFetcher.class);
+private GraphServiceClient graphClient;
+private MsGraphFetcherConfig msGraphFetcherConfig;
+private long[] throttleSeconds;
+
+public MicrosoftGraphFetcher() {
+
+}
+
+public MicrosoftGraphFetcher(MsGraphFetcherConfig msGraphFetcherConfig) {
+this.msGraphFetcherConfig = msGraphFetcherConfig;
+}
+
+/**
+ * Set seconds to throttle retries as a comma-delimited list, e.g.: 
30,60,120,600
+ *
+ * @param commaDelimitedLongs
+ * @throws TikaConfigException
+ */
+@Field
+public void setThrottleSeconds(String commaDelimitedLongs) throws 
TikaConfigException {
+String[] longStrings = commaDelimitedLongs.split(",");
+long[] seconds = new long[longStrings.length];
+for (int i = 0; i < longStrings.length; i++) {
+try {
+seconds[i] = Long.parseLong(longStrings[i]);
+} catch (NumberFormatException e) {
+throw new TikaConfigException(e.getMessage());
+}
+}
+setThrottleSeconds(seconds);
+}
+
+public void setThrottleSeconds(long[] throttleSeconds) {
+this.throttleSeconds = throttleSeconds;
+}
+
+@Override
+public void initialize(Map map) {
+String[] scopes = msGraphFetcherConfig.getScopes().toArray(new 
String[0]);
+if (msGraphFetcherConfig.getCredentials() instanceof 
ClientCertificateCredentialsConfig) {
+ClientCertificateCredentialsConfig credentials =
+(ClientCertificateCredentialsConfig) 
msGraphFetcherConfig.getCredentials();
+graphClient = new GraphServiceClient(
+new 
ClientCertificateCredentialBuilder().clientId(credentials.getClientId())
+
.tenantId(credentials.getTenantId()).pfxCertificate(
+new 
ByteArrayInputStream(credentials.getCertificateBytes()))
+
.clientCertificatePassword(credentials.getCertificatePassword())
+.build(), scopes);
+} else if (msGraphFetcherConfig.getCredentials() instanceof 
ClientSecretCredentialsConfig) {
+ClientSecretCredentialsConfig credentials =
+(ClientSecretCredentialsConfig) 

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850706#comment-17850706
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

tballison commented on PR #1778:
URL: https://github.com/apache/tika/pull/1778#issuecomment-2139485380

   @nddipiazza I don't mean to cause you more work... is it possible to rebase 
on the TIKA-4260 branch or merge into that maybe and we can work together there?
   
   




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4220) Commons-compress too lenient on headless tar detection

2024-05-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850703#comment-17850703
 ] 

ASF GitHub Bot commented on TIKA-4220:
--

tballison opened a new pull request, #1790:
URL: https://github.com/apache/tika/pull/1790

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Commons-compress too lenient on headless tar detection
> --
>
> Key: TIKA-4220
> URL: https://issues.apache.org/jira/browse/TIKA-4220
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> On recent regression tests on TIKA-4218, we noticed a fairly major change 
> with an increased rate of false positives on headless tar detection from 
> commons-compress.
> I think for now we should copy/paste/fork the headless tar detection and 
> improve it/revert it or possibly remove it for our 2.9.2 release.
> On this ticket, I'll look into what changed recently in headless tar 
> detection in commons-compress and experiment with fixing it.
> One challenge is that our magic bytes detection happens _after_ our custom 
> detectors, which means that we can't put a low confidence on what comes out 
> of our custom detectors and let the magic detection fix it. We could  
> implement an x-tar special case, but I really don't like that.
> Let's see what we can do...
> The numbers below represent the number of files identified as A (in tika 
> 2.9.1) -> B (in tika-2.9.2-pre-rc1).
> application/octet-stream -> application/x-tar 826
> multipart/appledouble -> application/x-tar701
> image/x-tga -> application/x-tar  322
> image/vnd.microsoft.icon -> application/x-tar 312
> application/vnd.iccprofile -> application/x-tar   221
> video/mp4 -> application/x-tar177
> audio/mpeg -> application/x-tar   59
> video/x-m4v -> application/x-tar  59
> application/x-font-printer-metric -> application/x-tar36
> audio/mp4 -> application/x-tar25
> application/x-tex-tfm -> application/x-tar18
> image/x-pict -> application/x-tar 15
> image/png -> application/x-tar8
> text/plain; charset=ISO-8859-1 -> application/x-tar   8
> application/x-endnote-style -> application/x-tar  7
> application/x-font-ttf -> application/x-tar   6



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4221) Regression in pack200 parsing in commons-compress

2024-05-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850704#comment-17850704
 ] 

ASF GitHub Bot commented on TIKA-4221:
--

tballison merged PR #1789:
URL: https://github.com/apache/tika/pull/1789




> Regression in pack200 parsing in commons-compress
> -
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There's a regression in pack200 that leads to the InputStream being closed 
> even if wrapped in a CloseShieldInputStream.
> This was the original signal that something was wrong, but the real problem 
> is in pack200, not xz.
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   ... 85 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4221) Regression in pack200 parsing in commons-compress

2024-05-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850691#comment-17850691
 ] 

ASF GitHub Bot commented on TIKA-4221:
--

tballison opened a new pull request, #1789:
URL: https://github.com/apache/tika/pull/1789

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Regression in pack200 parsing in commons-compress
> -
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There's a regression in pack200 that leads to the InputStream being closed 
> even if wrapped in a CloseShieldInputStream.
> This was the original signal that something was wrong, but the real problem 
> is in pack200, not xz.
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849561#comment-17849561
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza opened a new pull request, #1778:
URL: https://github.com/apache/tika/pull/1778

   * add a parse context
   * allow additional data to be sent int the parse context to the fetch method




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849560#comment-17849560
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza closed pull request #1774: TIKA-4252 fetch tuple metadata
URL: https://github.com/apache/tika/pull/1774




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849384#comment-17849384
 ] 

ASF GitHub Bot commented on TIKA-4260:
--

tballison commented on PR #1776:
URL: https://github.com/apache/tika/pull/1776#issuecomment-2130252325

   I'm now getting a clean build with `-DskipTests` lol... That's a step at 
least.
   
   The big TODO is to add serialization of the ParseContext in 
https://github.com/apache/tika/blob/main/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonFetchEmitTuple.java




> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4261) Add attachment type metadata filter

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849379#comment-17849379
 ] 

ASF GitHub Bot commented on TIKA-4261:
--

tballison merged PR #1777:
URL: https://github.com/apache/tika/pull/1777




> Add attachment type metadata filter
> ---
>
> Key: TIKA-4261
> URL: https://issues.apache.org/jira/browse/TIKA-4261
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> For some users who are using the /rmeta endpoint or -J option in tika-app, 
> inlining ocr'd content, there is no need to include the metadata object for 
> the inlined image. Let's add a metadata filter to remove these metadata 
> objects.
> The default behavior will be as before. Everything is included. Users need to 
> configure this to remove these inline objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4261) Add attachment type metadata filter

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849369#comment-17849369
 ] 

ASF GitHub Bot commented on TIKA-4261:
--

tballison opened a new pull request, #1777:
URL: https://github.com/apache/tika/pull/1777

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add attachment type metadata filter
> ---
>
> Key: TIKA-4261
> URL: https://issues.apache.org/jira/browse/TIKA-4261
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> For some users who are using the /rmeta endpoint or -J option in tika-app, 
> inlining ocr'd content, there is no need to include the metadata object for 
> the inlined image. Let's add a metadata filter to remove these metadata 
> objects.
> The default behavior will be as before. Everything is included. Users need to 
> configure this to remove these inline objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4259) Decouple xml parser stuff from ParseContext

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849299#comment-17849299
 ] 

ASF GitHub Bot commented on TIKA-4259:
--

tballison merged PR #1775:
URL: https://github.com/apache/tika/pull/1775




> Decouple xml parser stuff from ParseContext
> ---
>
> Key: TIKA-4259
> URL: https://issues.apache.org/jira/browse/TIKA-4259
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> ParseContext has some xmlreader convenience methods. We should move those to 
> XMLReaderUtils in 3.x to simplify ParseContext's api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849296#comment-17849296
 ] 

ASF GitHub Bot commented on TIKA-4260:
--

tballison opened a new pull request, #1776:
URL: https://github.com/apache/tika/pull/1776

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849297#comment-17849297
 ] 

ASF GitHub Bot commented on TIKA-4260:
--

tballison commented on PR #1776:
URL: https://github.com/apache/tika/pull/1776#issuecomment-2129532368

   Current status 

> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4259) Decouple xml parser stuff from ParseContext

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849117#comment-17849117
 ] 

ASF GitHub Bot commented on TIKA-4259:
--

tballison opened a new pull request, #1775:
URL: https://github.com/apache/tika/pull/1775

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Decouple xml parser stuff from ParseContext
> ---
>
> Key: TIKA-4259
> URL: https://issues.apache.org/jira/browse/TIKA-4259
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> ParseContext has some xmlreader convenience methods. We should move those to 
> XMLReaderUtils in 3.x to simplify ParseContext's api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848959#comment-17848959
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza commented on PR #1774:
URL: https://github.com/apache/tika/pull/1774#issuecomment-2127120285

   oops not quite right - need to sync up with @tballison to make sure i'm 
covering his needs and not just my own 




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848808#comment-17848808
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza opened a new pull request, #1774:
URL: https://github.com/apache/tika/pull/1774

   Add ability to add Tika Fetch Metadata




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848341#comment-17848341
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

tballison closed pull request #19: Add Github CI workflows for multi-arch 
Docker images 

> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848342#comment-17848342
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

tballison commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2123275951

   At long last, I think we're all set. Thank you @fpiesche for opening this 
and for all of your work on it! I'm sorry I took the more daft option, but here 
we are. If you still want, please do open a separate issue for dependabot.
   
   Thank you to all who helped with this and tested the alpha release!




> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848087#comment-17848087
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

nextgens commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2122027255

   I have just tried ``apache/tika:2.9.2-alpha-multi-arch-full`` on an Ampere 
A1 (arm64) box and that seems to work fine




> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4257) Tika detect() recognizes some p7m files as format x-dbf

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847981#comment-17847981
 ] 

ASF GitHub Bot commented on TIKA-4257:
--

tballison merged PR #1773:
URL: https://github.com/apache/tika/pull/1773




> Tika detect() recognizes some p7m files as format x-dbf
> ---
>
> Key: TIKA-4257
> URL: https://issues.apache.org/jira/browse/TIKA-4257
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.2
>Reporter: Luca Bentivoglio
>Priority: Major
> Attachments: JUTestFileFormatDetectionTika.java, test.zip.p7m, 
> test_firmato.zip.p7m
>
>
> Tika detect method sometimes recognizes p7m files as format application/x-dbf.
> In the attachment I leave a pair of specific examples
> test_firmato.zip.p7m  that contains a zip containing a signed xml
> test.zip.p7m  that contains a zip containing an unsigned xml
> I also attach a possible JUnit test with which the problem occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4257) Tika detect() recognizes some p7m files as format x-dbf

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847972#comment-17847972
 ] 

ASF GitHub Bot commented on TIKA-4257:
--

tballison opened a new pull request, #1773:
URL: https://github.com/apache/tika/pull/1773

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Tika detect() recognizes some p7m files as format x-dbf
> ---
>
> Key: TIKA-4257
> URL: https://issues.apache.org/jira/browse/TIKA-4257
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.2
>Reporter: Luca Bentivoglio
>Priority: Major
> Attachments: JUTestFileFormatDetectionTika.java, test.zip.p7m, 
> test_firmato.zip.p7m
>
>
> Tika detect method sometimes recognizes p7m files as format application/x-dbf.
> In the attachment I leave a pair of specific examples
> test_firmato.zip.p7m  that contains a zip containing a signed xml
> test.zip.p7m  that contains a zip containing an unsigned xml
> I also attach a possible JUnit test with which the problem occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847947#comment-17847947
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

tballison commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2120876458

   > I think building multiarch with buildx requires QEMU, but as long as 
that's available on the host doing the builds just running buildx should be 
perfectly fine - that's all the github workflow does after all!




> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847945#comment-17847945
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

hegerdes commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2120869924

   > Wow...it looks like it actually worked?!
   > 
   > Can you all give this a shot? 
https://hub.docker.com/layers/apache/tika/2.9.2-alpha-multi-arch/images/sha256-b8b6e02e3e9f98ddae33b74881f4ead7846ee12352d53149098857378bb3393d?context=repo
   
   Nice, thx 
   Runs just fine on a raspberry pi 4 (arm64)




> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847937#comment-17847937
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

tballison commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2120845395

   Wow...it looks like it actually worked?!
   
   Can you all give this a shot? 
https://hub.docker.com/layers/apache/tika/2.9.2-alpha-multi-arch/images/sha256-b8b6e02e3e9f98ddae33b74881f4ead7846ee12352d53149098857378bb3393d?context=repo




> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847929#comment-17847929
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

tballison commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2120807457

   Let's add other registries on a later ticket?
   
   How's this look? https://github.com/apache/tika-docker/pull/21
   
   I haven't tested it.




> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847905#comment-17847905
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

fpiesche commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2120688287

   I think building multiarch with buildx requires QEMU, but as long as that's 
available on the host doing the builds just running buildx should be perfectly 
fine - that's all the github workflow does after all!




> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847896#comment-17847896
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

tballison commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2120577030

   > If securing the credentials required for dockerhub is the only concern, I 
think using github container registry instead may be a great solution. 
https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry
   > 
   > If you still want the images to be on dockerhub you could sync them 
(locally or otherwise) using a tool such as 
https://github.com/regclient/regclient/. We use it in Mailu, see 
https://github.com/Mailu/Mailu/blob/master/.github/workflows/mirror.yml#L35
   
   Awesome. Thank you. ASF infra has a way to do the auth. My current thinking 
is not to rework our workflow into github actions, but rather see if we can 
tweak our current workflow to get multi-arch images.




> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847895#comment-17847895
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

tballison commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2120574718

   How's this for a proposed way forward?
   
   We basically keep our current workflow on the release manager's 
laptop/hardware. We modify our build scripts to build a single-arch image, run 
our usual tests and then do a second call to docker buildx where we build 
multiarch images and then deploy to dockerhub?




> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847887#comment-17847887
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

nextgens commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2120530390

   If securing the credentials required for dockerhub is the only concern, I 
think using github container registry instead may be a great solution.
   
https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry
   
   If you still want the images to be on dockerhub you could sync them (locally 
or otherwise) using a tool such as https://github.com/regclient/regclient/. We 
use it in Mailu, see 
https://github.com/Mailu/Mailu/blob/master/.github/workflows/mirror.yml#L35




> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847884#comment-17847884
 ] 

ASF GitHub Bot commented on TIKA-4258:
--

tballison commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-2120501490

   It looks like Airflow at least has moved away from github actions and moved 
towards a release manager building locally and pushing to dockerhub 

> Multi-arch support for docker images
> 
>
> Key: TIKA-4258
> URL: https://issues.apache.org/jira/browse/TIKA-4258
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> This is a post-PR ticket to cover the work on: 
> https://github.com/apache/tika-docker/pull/19
> Related: https://issues.apache.org/jira/browse/INFRA-25803 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4256) Allow inlining of ocr'd text in container document

2024-05-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847874#comment-17847874
 ] 

ASF GitHub Bot commented on TIKA-4256:
--

tballison merged PR #1762:
URL: https://github.com/apache/tika/pull/1762




> Allow inlining of ocr'd text in container document
> --
>
> Key: TIKA-4256
> URL: https://issues.apache.org/jira/browse/TIKA-4256
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> For legacy tika, we're inlining all content from embedded files including ocr 
> content of embedded images.
> However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
> stitch inlined image ocr text back into the container file's content.
> For example, if a docx has an image in it and tesseract is invoked, the 
> structure will notionally be:
> [
>   { "type":"docx", "content": "main content of the file"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> It would be useful to allow an option to inline the extracted text in the 
> parent document. I think we want to keep the embedded inline object so that 
> we don't lose metadata from it. So I propose this kind of output:
> [
>   { "type":"docx", "content": "main content of the file  type=\"ocr\">ocr'd content"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> This proposal includes the ocr'd content marked by  in the container 
> file, and it includes the ocr'd text in the embedded image.
> For now this proposal does not include inlining ocr'd text from thumbnails. 
> We can do that on a later ticket if desired.
> This will allow a more intuitive search for non-file forensics users and will 
> be more similar to what we're doing with rendering a page -> ocr in PDFs when 
> that is configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4256) Allow inlining of ocr'd text in container document

2024-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847335#comment-17847335
 ] 

ASF GitHub Bot commented on TIKA-4256:
--

tballison opened a new pull request, #1762:
URL: https://github.com/apache/tika/pull/1762

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Allow inlining of ocr'd text in container document
> --
>
> Key: TIKA-4256
> URL: https://issues.apache.org/jira/browse/TIKA-4256
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> For legacy tika, we're inlining all content from embedded files including ocr 
> content of embedded images.
> However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
> stitch inlined image ocr text back into the container file's content.
> For example, if a docx has an image in it and tesseract is invoked, the 
> structure will notionally be:
> [
>   { "type":"docx", "content": "main content of the file"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> It would be useful to allow an option to inline the extracted text in the 
> parent document. I think we want to keep the embedded inline object so that 
> we don't lose metadata from it. So I propose this kind of output:
> [
>   { "type":"docx", "content": "main content of the file  type=\"ocr\">ocr'd content"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> This proposal includes the ocr'd content marked by  in the container 
> file, and it includes the ocr'd text in the embedded image.
> For now this proposal does not include inlining ocr'd text from thumbnails. 
> We can do that on a later ticket if desired.
> This will allow a more intuitive search for non-file forensics users and will 
> be more similar to what we're doing with rendering a page -> ocr in PDFs when 
> that is configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4255) TextAndCSVParser ignores Metadata.CONTENT_ENCODING

2024-05-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846908#comment-17846908
 ] 

ASF GitHub Bot commented on TIKA-4255:
--

axeld opened a new pull request, #1761:
URL: https://github.com/apache/tika/pull/1761

   If CSVParams.getCharset() is null, the passed in encoding is used before 
trying to auto detect it.
   




> TextAndCSVParser ignores Metadata.CONTENT_ENCODING
> --
>
> Key: TIKA-4255
> URL: https://issues.apache.org/jira/browse/TIKA-4255
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.6.0, 3.0.0-BETA, 2.9.2
>Reporter: Axel Dörfler
>Priority: Major
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> I pass a text to the auto-detect parser that just contains the text "ETL". I 
> pass on content type, and content encoding information via Metadata.
> However, TextAndCSVParser ignores the provided encoding (since CSVParams has 
> not provided via TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE), and chooses 
> to rather detect it by itself. Turns out it detects some IBM424 hebrew 
> charset, and uses that which results in a kind of surprising output.
> Tested with the mentioned versions, though the bug should be much older 
> already.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845649#comment-17845649
 ] 

ASF GitHub Bot commented on TIKA-4254:
--

kaiyaok2 commented on PR #1754:
URL: https://github.com/apache/tika/pull/1754#issuecomment-2106037067

   @THausherr @tballison  I confirmed that the two lines in `@BeforeEach` 
**does not** create a new repo if one exists from a previous test run:
   ```
   TikaConfig config = TikaConfig.getDefaultConfig();
   repo = config.getMimeRepository();
   ```
   
   
   `TikaConfig.getDefaultConfig()` simply calls the default `TikaConfig()` 
constructor 
(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L390).
 
   
   When the system property `'tika.config'` and the environment variable 
`'TIKA_CONFIG'` are both not set, the `mimeTypes` field (accessible by 
`getMimeRepository()` - which is `repo` in our context) of the constructed 
config will be 
`getDefaultMimeTypes(getContextClassLoader())`(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L246).
   
   Now take a look at `getDefaultMimeTypes()` - when a classloader is given 
(`getContextClassLoader()` in our context), it first tries to retrieve from a 
HashMap via `CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader);` 
(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L150).
 Notice that `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` is not an instance variable, 
but a **static** `HashMap`. 
   
   So in the first test execution, the `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` is 
empty, so `types` after the line `types = 
CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader);` will be `null`, and is 
later initialized by `MimeTypesFactory.create()` as desired. After this, the 
initialized `types` is put to the static `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` 
map 
(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L166).
 
   
   Now in the second test execution, the `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` 
already has the key of the context class loader, with corresponding `types` 
initialized from the previous run. So 
`CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader)` will return such 
initialized object directly. In other words, `repo` **would be the same object 
across repeated test runs**.
   
   I think the essential idea of `CLASSLOADER_SPECIFIC_DEFAULT_TYPES` is 1-to-1 
map between classloaders and default types, so this implementation does not 
seem buggy for me, but please confirm.  
   




> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> 
>
> Key: TIKA-4254
> URL: https://issues.apache.org/jira/browse/TIKA-4254
> Project: Tika
>  Issue Type: Bug
>Reporter: Kaiyao Ke
>Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>   at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>   at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>   at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>   at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests 

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845623#comment-17845623
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza commented on code in PR #1753:
URL: https://github.com/apache/tika/pull/1753#discussion_r1597463036


##
tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java:
##
@@ -455,33 +455,33 @@ private Fetcher getFetcher(FetchEmitTuple t) {
 }
 }
 
-protected MetadataListAndEmbeddedBytes parseFromTuple(FetchEmitTuple t, 
Fetcher fetcher) {
-FetchKey fetchKey = t.getFetchKey();
+protected MetadataListAndEmbeddedBytes parseFromTuple(FetchEmitTuple 
fetchEmitTuple, Fetcher fetcher) {
+FetchKey fetchKey = fetchEmitTuple.getFetchKey();
+Metadata fetchResponseMetadata = new Metadata();

Review Comment:
   shoot i didn't realize i was deplying broken builds! reverted. i'll make 
this change and make a new pr 





> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845595#comment-17845595
 ] 

ASF GitHub Bot commented on TIKA-4254:
--

kaiyaok2 commented on PR #1754:
URL: https://github.com/apache/tika/pull/1754#issuecomment-2105684175

   > getMimeRepository
   
   @THausherr I think it might the case. I wrote this dummy test, and it fails 
under surefire:
   ```
   @Test
   public void testResetRepo() throws Exception {
   TikaConfig config0 = TikaConfig.getDefaultConfig();
   MimeTypes repo0 = config0.getMimeRepository();
   MimeType testType0 = new MimeType(MediaType.parse("baz/bar"));
   String pattern = "rtg_sst_grb_0\\.5\\.\\d{9}";
   repo0.addPattern(testType0, pattern, true);
   
   TikaConfig config1 = TikaConfig.getDefaultConfig();
   MimeTypes repo1 = config1.getMimeRepository();
   MimeType testType1 = new MimeType(MediaType.parse("baz/bar"));
   repo1.addPattern(testType1, pattern, true);
   }
   ```
   
   Error Message:
   ```
   [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
0.857 s <<< FAILURE! 

> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> 
>
> Key: TIKA-4254
> URL: https://issues.apache.org/jira/browse/TIKA-4254
> Project: Tika
>  Issue Type: Bug
>Reporter: Kaiyao Ke
>Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>   at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>   at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>   at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>   at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests in the 
> same environment:
> ```
> cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
> mvn edu.illinois:NIOInspector:rerun 
> -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
> ```
> ### Proposed Fix
> Declare `testType` and `testType2` as static variables and initialize them at 
> class loading time. Therefore, repeated runs of `testJavaRegex()` will not 
> conflict each other. All tests pass and are idempotent after the fix.
> ### Necessity of Fix
> A fix is recommended as unit tests shall be idempotent, and state pollution 
> shall be mitigated so that newly introduced tests do not fail in the future 
> due to polluted shared states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845590#comment-17845590
 ] 

ASF GitHub Bot commented on TIKA-4254:
--

THausherr commented on PR #1754:
URL: https://github.com/apache/tika/pull/1754#issuecomment-2105679546

   Maybe I get it: `repo = config.getMimeRepository();` isn't creating anything 
new, it's retrieving something that is changed later by the test? If my 
understanding is correct then it's a deeper problem.




> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> 
>
> Key: TIKA-4254
> URL: https://issues.apache.org/jira/browse/TIKA-4254
> Project: Tika
>  Issue Type: Bug
>Reporter: Kaiyao Ke
>Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>   at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>   at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>   at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>   at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests in the 
> same environment:
> ```
> cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
> mvn edu.illinois:NIOInspector:rerun 
> -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
> ```
> ### Proposed Fix
> Declare `testType` and `testType2` as static variables and initialize them at 
> class loading time. Therefore, repeated runs of `testJavaRegex()` will not 
> conflict each other. All tests pass and are idempotent after the fix.
> ### Necessity of Fix
> A fix is recommended as unit tests shall be idempotent, and state pollution 
> shall be mitigated so that newly introduced tests do not fail in the future 
> due to polluted shared states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845586#comment-17845586
 ] 

ASF GitHub Bot commented on TIKA-4254:
--

kaiyaok2 commented on PR #1754:
URL: https://github.com/apache/tika/pull/1754#issuecomment-2105675512

   > The `repo` is refreshed with each unit test in the `@BeforeEach` call, 
though. Is NIODetector respecting that?
   
   @tballison Yes, NIOInspector uses the JUnit Jupiter engine and takes into 
account of all setup and teardown methods. Notice that although the `MimeTypes` 
instance `repo` is refreshed, `MimeTypes.addPattern()` calls `Patterns.add()` 
,which then calls `addGlob()`:
   ```
   private void addGlob(String glob, MimeType type) throws MimeTypeException {
   MimeType previous = globs.get(glob);
   if (previous == null || 
registry.isSpecializationOf(previous.getType(), type.getType())) {
   globs.put(glob, type);
   } else if (previous == type ||
   registry.isSpecializationOf(type.getType(), 
previous.getType())) {
   // do nothing
   } else {
   throw new MimeTypeException("Conflicting glob pattern: " + glob);
   }
   }
   ```
   In the second execution of the test, `previous` would be the `testType` 
object constructed in the first test run, while `type` is the `testType` object 
constructed in the second test run (from 2 different calls to `new 
MimeType(MediaType.parse("foo/bar"))`. Now since `previous != type` are not the 
same, the exception is thrown. 
   
   Ideally we shall go to the `// do nothing` branch in repeated runs, thus the 
fix.




> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> 
>
> Key: TIKA-4254
> URL: https://issues.apache.org/jira/browse/TIKA-4254
> Project: Tika
>  Issue Type: Bug
>Reporter: Kaiyao Ke
>Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>   at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>   at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>   at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>   at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests in the 
> same environment:
> ```
> cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
> mvn edu.illinois:NIOInspector:rerun 
> -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
> ```
> ### Proposed Fix
> Declare `testType` and `testType2` as static variables and initialize them at 
> class loading time. Therefore, repeated runs of `testJavaRegex()` will not 
> conflict each other. All tests pass and are idempotent after the fix.
> ### Necessity of Fix
> A fix is recommended as unit tests shall be idempotent, and state pollution 
> shall be mitigated so that newly introduced tests do not fail in the future 
> due to polluted shared states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845583#comment-17845583
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

tballison commented on code in PR #1753:
URL: https://github.com/apache/tika/pull/1753#discussion_r1597416611


##
tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java:
##
@@ -455,33 +455,33 @@ private Fetcher getFetcher(FetchEmitTuple t) {
 }
 }
 
-protected MetadataListAndEmbeddedBytes parseFromTuple(FetchEmitTuple t, 
Fetcher fetcher) {
-FetchKey fetchKey = t.getFetchKey();
+protected MetadataListAndEmbeddedBytes parseFromTuple(FetchEmitTuple 
fetchEmitTuple, Fetcher fetcher) {
+FetchKey fetchKey = fetchEmitTuple.getFetchKey();
+Metadata fetchResponseMetadata = new Metadata();

Review Comment:
   @nddipiazza any chance you can revert this in main so that we have a working 
build? Thank you!





> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845581#comment-17845581
 ] 

ASF GitHub Bot commented on TIKA-4254:
--

tballison commented on PR #1754:
URL: https://github.com/apache/tika/pull/1754#issuecomment-2105664986

   The `repo` is refreshed with each unit test in the `@BeforeEach` call, 
though. Is NIODetector respecting that?




> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> 
>
> Key: TIKA-4254
> URL: https://issues.apache.org/jira/browse/TIKA-4254
> Project: Tika
>  Issue Type: Bug
>Reporter: Kaiyao Ke
>Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>   at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>   at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>   at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>   at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests in the 
> same environment:
> ```
> cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
> mvn edu.illinois:NIODetector:rerun 
> -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
> ```
> ### Proposed Fix
> Declare `testType` and `testType2` as static variables and initialize them at 
> class loading time. Therefore, repeated runs of `testJavaRegex()` will not 
> conflict each other. All tests pass and are idempotent after the fix.
> ### Necessity of Fix
> A fix is recommended as unit tests shall be idempotent, and state pollution 
> shall be mitigated so that newly introduced tests do not fail in the future 
> due to polluted shared states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845560#comment-17845560
 ] 

ASF GitHub Bot commented on TIKA-4254:
--

kaiyaok2 opened a new pull request, #1754:
URL: https://github.com/apache/tika/pull/1754

   Fixes https://issues.apache.org/jira/projects/TIKA/issues/TIKA-4254
   
   ### Brief Description of the Bug
   
   The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in 
the first run but fails in the second run in the same environment. The source 
of the problem is that each test execution initializes a new media type 
(`MimeType`) instance `testType` (same problem for `testType2`), and all media 
types across different test executions attempt to use the same name pattern 
`"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of the test, 
the line `this.repo.addPattern(testType, pattern, true);` will throw an error, 
since the name pattern is already used by the `testType` instance initiated 
from the first test execution. Specifically, in the second run, the `addGlob()` 
method of the `Pattern` class will assert conflict patterns and throw 
a`MimeTypeException`(line 123 in `Patterns.java`).
   
   ### Failure Message in the 2nd Test Run:
   ```
   org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
rtg_sst_grb_0\.5\.\d{8}
at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
at org.apache.tika.mime.Patterns.add(Patterns.java:71)
at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
at 
org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
   ```
   
   ### Reproduce
   
   Use the `NIOInspector` plugin that supports rerunning individual tests in 
the same environment:
   ```
   cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
   mvn edu.illinois:NIODetector:rerun 
-Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
   ```
   
   ### Proposed Fix
   
   Declare `testType` and `testType2` as static variables and initialize them 
at class loading time. Therefore, repeated runs of `testJavaRegex()` will not 
conflict each other. All tests pass and are idempotent after the fix.
   
   ### Necessity of Fix
   
   A fix is recommended as unit tests shall be idempotent, and state pollution 
shall be mitigated so that newly introduced tests do not fail in the future due 
to polluted shared states.
   
   
   




> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> 
>
> Key: TIKA-4254
> URL: https://issues.apache.org/jira/browse/TIKA-4254
> Project: Tika
>  Issue Type: Bug
>Reporter: Kaiyao Ke
>Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>   at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>   at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>   at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>   at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests in the 
> same environment:
> ```
> cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
> mvn edu.illinois:NIODetector:rerun 

[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845510#comment-17845510
 ] 

ASF GitHub Bot commented on TIKA-4232:
--

lewismc commented on PR #17:
URL: https://github.com/apache/tika-helm/pull/17#issuecomment-2105292770

   INFRA ticket was resolved and everything passing great now. 




> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845509#comment-17845509
 ] 

ASF GitHub Bot commented on TIKA-4232:
--

lewismc merged PR #17:
URL: https://github.com/apache/tika-helm/pull/17




> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845508#comment-17845508
 ] 

ASF GitHub Bot commented on TIKA-4232:
--

lewismc opened a new pull request, #17:
URL: https://github.com/apache/tika-helm/pull/17

   PR to address https://issues.apache.org/jira/browse/TIKA-4232




> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845507#comment-17845507
 ] 

ASF GitHub Bot commented on TIKA-4232:
--

lewismc closed pull request #17: TIKA-4232 Create and execute unit tests for 
tika-helm
URL: https://github.com/apache/tika-helm/pull/17




> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845302#comment-17845302
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

tballison commented on code in PR #1753:
URL: https://github.com/apache/tika/pull/1753#discussion_r1596634451


##
tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java:
##
@@ -455,33 +455,33 @@ private Fetcher getFetcher(FetchEmitTuple t) {
 }
 }
 
-protected MetadataListAndEmbeddedBytes parseFromTuple(FetchEmitTuple t, 
Fetcher fetcher) {
-FetchKey fetchKey = t.getFetchKey();
+protected MetadataListAndEmbeddedBytes parseFromTuple(FetchEmitTuple 
fetchEmitTuple, Fetcher fetcher) {
+FetchKey fetchKey = fetchEmitTuple.getFetchKey();
+Metadata fetchResponseMetadata = new Metadata();

Review Comment:
   The metadata that goes in the fetchemittuple was envisioned to be 
user-injected metadata that was injected after the parse and then emitted (e.g. 
provenance metadata).
   
   I think we need to put both metadatas on the fetchemittuple.
   
   This is what I'm thinking...let me know what you think.
   
   So, there will be three metadatas in play. The fetchemit tuple will have a 
fetchRequestMetadata (???) and a userMetadata (???). At parse time, we'll 
create a fresh metadata object, which we'll call "responseMetadata" in the 
following call: fetcher.fetch(requestMetadata, responseMetadata).
   
   The parse will then use the responseMetadata and, after the parse, inject 
the userMetadata from the fetchEmitTuple.
   
   The fetcher may use the fetchRequestMetadata to carry out its request, but 
info from that one should not make it into the "responseMetadata" nor make it 
into the emit data.





> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845299#comment-17845299
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

tballison commented on code in PR #1753:
URL: https://github.com/apache/tika/pull/1753#discussion_r1596634451


##
tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java:
##
@@ -455,33 +455,33 @@ private Fetcher getFetcher(FetchEmitTuple t) {
 }
 }
 
-protected MetadataListAndEmbeddedBytes parseFromTuple(FetchEmitTuple t, 
Fetcher fetcher) {
-FetchKey fetchKey = t.getFetchKey();
+protected MetadataListAndEmbeddedBytes parseFromTuple(FetchEmitTuple 
fetchEmitTuple, Fetcher fetcher) {
+FetchKey fetchKey = fetchEmitTuple.getFetchKey();
+Metadata fetchResponseMetadata = new Metadata();

Review Comment:
   The metadata that goes in the fetchemittuple was envisioned to be 
user-injected metadata that passed through the parse process and was emitted 
(provenance metadata).
   
   I think we need to put both metadatas on the fetchemittuple.





> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845207#comment-17845207
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza merged PR #1753:
URL: https://github.com/apache/tika/pull/1753




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845204#comment-17845204
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza opened a new pull request, #1753:
URL: https://github.com/apache/tika/pull/1753

   add request metadata




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845031#comment-17845031
 ] 

ASF GitHub Bot commented on TIKA-4232:
--

lewismc commented on PR #17:
URL: https://github.com/apache/tika-helm/pull/17#issuecomment-2102889158

   PR updated to address prior blocker related to use of unapproved GitHub 
Actions. Waiting on https://issues.apache.org/jira/browse/INFRA-25775




> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845012#comment-17845012
 ] 

ASF GitHub Bot commented on TIKA-4250:
--

tballison merged PR #1751:
URL: https://github.com/apache/tika/pull/1751




> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845005#comment-17845005
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza opened a new pull request, #1752:
URL: https://github.com/apache/tika/pull/1752

   * metadata was not getting sent to the fetch process




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845006#comment-17845006
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza merged PR #1752:
URL: https://github.com/apache/tika/pull/1752




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845003#comment-17845003
 ] 

ASF GitHub Bot commented on TIKA-4233:
--

lewismc merged PR #18:
URL: https://github.com/apache/tika-helm/pull/18




> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844997#comment-17844997
 ] 

ASF GitHub Bot commented on TIKA-4250:
--

tballison opened a new pull request, #1751:
URL: https://github.com/apache/tika/pull/1751

   …readpst
   
   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add a libpst-based parser
> -
>
> Key: TIKA-4250
> URL: https://issues.apache.org/jira/browse/TIKA-4250
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 8.eml, 8.msg
>
>
> We currently use the com.pff Java-based PST parser for PST files. It would be 
> useful to add a wrapper for libpst as an optional parser. 
> One of the benefits of libpst is that it creates .eml or .msg files from the 
> PST records. This is critical for those who want the original bytes from 
> embedded files. Obv, PST doesn't store eml or msg, but some users want the 
> "original" emails even if they are constructed from PST records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4221) Regression in pack200 parsing in commons-compress

2024-05-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844651#comment-17844651
 ] 

ASF GitHub Bot commented on TIKA-4221:
--

tballison merged PR #1750:
URL: https://github.com/apache/tika/pull/1750




> Regression in pack200 parsing in commons-compress
> -
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There's a regression in pack200 that leads to the InputStream being closed 
> even if wrapped in a CloseShieldInputStream.
> This was the original signal that something was wrong, but the real problem 
> is in pack200, not xz.
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   ... 85 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4221) Regression in pack200 parsing in commons-compress

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844406#comment-17844406
 ] 

ASF GitHub Bot commented on TIKA-4221:
--

tballison opened a new pull request, #1750:
URL: https://github.com/apache/tika/pull/1750

   This cherry-picks from 2.x the workaround for TIKA-4221
   
   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Regression in pack200 parsing in commons-compress
> -
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There's a regression in pack200 that leads to the InputStream being closed 
> even if wrapped in a CloseShieldInputStream.
> This was the original signal that something was wrong, but the real problem 
> is in pack200, not xz.
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 

[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844391#comment-17844391
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

tballison commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2098906252

   I just asked on our dev list. I'd like to get 3.x out soon. We need a beta2 
release, though, I think.




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844325#comment-17844325
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

dsvensson commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2098585131

   @tballison Will this be backported to Tika 2.x, or if not, how far off is 
Tika 3.x?




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3347) Upgrade to PDFBox 3.x when available

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844324#comment-17844324
 ] 

ASF GitHub Bot commented on TIKA-3347:
--

danielstravito commented on PR #1473:
URL: https://github.com/apache/tika/pull/1473#issuecomment-2098582675

   @tballison Will this be backported to Tika 2.x, or if not, how far off is 
Tika 3.x?




> Upgrade to PDFBox 3.x when available
> 
>
> Key: TIKA-3347
> URL: https://issues.apache.org/jira/browse/TIKA-3347
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> 3.0.0-RC1 was recently released.  We should integrate it on a dev branch asap 
> so that we can help with regression testing...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-05-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842603#comment-17842603
 ] 

ASF GitHub Bot commented on TIKA-4249:
--

tballison merged PR #1739:
URL: https://github.com/apache/tika/pull/1739




> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
>
> We recently upgrade from 2.9.0 to 2.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4248) Improve PST handling of attachments

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842432#comment-17842432
 ] 

ASF GitHub Bot commented on TIKA-4248:
--

tballison merged PR #1738:
URL: https://github.com/apache/tika/pull/1738




> Improve PST handling of attachments
> ---
>
> Key: TIKA-4248
> URL: https://issues.apache.org/jira/browse/TIKA-4248
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> The PST parser doesn't handle attachments in quite the same way as other 
> parsers which hinders analysis of attachments.
> The problem is that the PST parser handles the text content of an email and 
> the embedded attachments. And, the PST parser processes attachments before 
> the main body. These two features make the normal patterns for embedded 
> attachments break down in the RecursiveParserWrapper. For example, when the 
> attachments are being processed, the RecursiveParserWrapper can't figure out 
> what the path will be through the "body" because that hasn't been parsed yet.
> We should probably create a PSTMailItemParser that handles the content and 
> the attachments like other parsers so that embedded paths can be maintained.
> This will be a breaking change, and I'm targeting it only to the 3.x branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842431#comment-17842431
 ] 

ASF GitHub Bot commented on TIKA-4249:
--

tballison opened a new pull request, #1739:
URL: https://github.com/apache/tika/pull/1739

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> EML file is treating it as text file in 3.9.2 version
> -
>
> Key: TIKA-4249
> URL: https://issues.apache.org/jira/browse/TIKA-4249
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Blocker
> Attachments: Email_Received.txt
>
>
> We recently upgrade from 3.9.0 to 3.9.2. In that we found that the attached 
> file is treating it as text file instead of email file. please look into this 
> issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4248) Improve PST handling of attachments

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842399#comment-17842399
 ] 

ASF GitHub Bot commented on TIKA-4248:
--

tballison opened a new pull request, #1738:
URL: https://github.com/apache/tika/pull/1738

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Improve PST handling of attachments
> ---
>
> Key: TIKA-4248
> URL: https://issues.apache.org/jira/browse/TIKA-4248
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> The PST parser doesn't handle attachments in quite the same way as other 
> parsers which hinders analysis of attachments.
> The problem is that the PST parser handles the text content of an email and 
> the embedded attachments. And, the PST parser processes attachments before 
> the main body. These two features make the normal patterns for embedded 
> attachments break down in the RecursiveParserWrapper. For example, when the 
> attachments are being processed, the RecursiveParserWrapper can't figure out 
> what the path will be through the "body" because that hasn't been parsed yet.
> We should probably create a PSTMailItemParser that handles the content and 
> the attachments like other parsers so that embedded paths can be maintained.
> This will be a breaking change, and I'm targeting it only to the 3.x branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4247) HttpFetcher - add ability to send request headers

2024-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842134#comment-17842134
 ] 

ASF GitHub Bot commented on TIKA-4247:
--

nddipiazza commented on code in PR #1737:
URL: https://github.com/apache/tika/pull/1737#discussion_r1583602257


##
tika-pipes/tika-fetchers/tika-fetcher-http/src/main/java/org/apache/tika/pipes/fetcher/http/HttpFetcher.java:
##
@@ -143,10 +143,24 @@ public InputStream fetch(String fetchKey, Metadata 
metadata) throws IOException,
 .setMaxRedirects(maxRedirects)
 .setRedirectsEnabled(true).build();
 get.setConfig(requestConfig);
-if (! StringUtils.isBlank(userAgent)) {
+setHttpRequestHeaders(metadata, get);
+return execute(get, metadata, httpClient, true);
+}
+
+private void setHttpRequestHeaders(Metadata metadata, HttpGet get) {
+if (!StringUtils.isBlank(userAgent)) {
 get.setHeader(USER_AGENT, userAgent);
 }
-return execute(get, metadata, httpClient, true);
+// additional http request headers can be sent in here.
+String[] httpRequestHeaders = metadata.getValues("httpRequestHeaders");

Review Comment:
   httpHeaders is a config parameter that is for storing some of the http 
headers on the response. I will look into documenting the jsons somehow to 
prevent confusion in future. 





> HttpFetcher - add ability to send request headers
> -
>
> Key: TIKA-4247
> URL: https://issues.apache.org/jira/browse/TIKA-4247
> Project: Tika
>  Issue Type: New Feature
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> add ability to send request headers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4247) HttpFetcher - add ability to send request headers

2024-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842123#comment-17842123
 ] 

ASF GitHub Bot commented on TIKA-4247:
--

bartek commented on code in PR #1737:
URL: https://github.com/apache/tika/pull/1737#discussion_r1583509595


##
tika-pipes/tika-fetchers/tika-fetcher-http/src/main/java/org/apache/tika/pipes/fetcher/http/HttpFetcher.java:
##
@@ -143,10 +143,24 @@ public InputStream fetch(String fetchKey, Metadata 
metadata) throws IOException,
 .setMaxRedirects(maxRedirects)
 .setRedirectsEnabled(true).build();
 get.setConfig(requestConfig);
-if (! StringUtils.isBlank(userAgent)) {
+setHttpRequestHeaders(metadata, get);
+return execute(get, metadata, httpClient, true);
+}
+
+private void setHttpRequestHeaders(Metadata metadata, HttpGet get) {
+if (!StringUtils.isBlank(userAgent)) {
 get.setHeader(USER_AGENT, userAgent);
 }
-return execute(get, metadata, httpClient, true);
+// additional http request headers can be sent in here.
+String[] httpRequestHeaders = metadata.getValues("httpRequestHeaders");

Review Comment:
   Should the key here be `httpHeaders` as per the http-fetcher schema? 
(embedding here as I am not sure where this is hosted)? It looks like this code 
is reading from the user-provided Metadata
   
   ```json
   {
   "$schema": "http://json-schema.org/draft-07/schema#;,
   "type": "object",
   "properties": {
   "authScheme": {
   "type": "string"
   },
   "connectTimeout": {
   "type": "integer"
   },
   "httpHeaders": {
   "type": "array",
   "items": {
   "type": "string"
   }
   },
   "jwtExpiresInSeconds": {
   "type": "integer"
   },
   "jwtIssuer": {
   "type": "string"
   },
   "jwtPrivateKeyBase64": {
   "type": "string"
   },
   "jwtSecret": {
   "type": "string"
   },
   "jwtSubject": {
   "type": "string"
   },
   "maxConnections": {
   "type": "integer"
   },
   "maxConnectionsPerRoute": {
   "type": "integer"
   },
   "maxErrMsgSize": {
   "type": "integer"
   },
   "maxRedirects": {
   "type": "integer"
   },
   "maxSpoolSize": {
   "type": "integer"
   },
   "ntDomain": {
   "type": "string"
   },
   "overallTimeout": {
   "type": "integer"
   },
   "password": {
   "type": "string"
   },
   "proxyHost": {
   "type": "string"
   },
   "proxyPort": {
   "type": "integer"
   },
   "requestTimeout": {
   "type": "integer"
   },
   "socketTimeout": {
   "type": "integer"
   },
   "userAgent": {
   "type": "string"
   },
   "userName": {
   "type": "string"
   }
   }
   }
   ```





> HttpFetcher - add ability to send request headers
> -
>
> Key: TIKA-4247
> URL: https://issues.apache.org/jira/browse/TIKA-4247
> Project: Tika
>  Issue Type: New Feature
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> add ability to send request headers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4247) HttpFetcher - add ability to send request headers

2024-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842120#comment-17842120
 ] 

ASF GitHub Bot commented on TIKA-4247:
--

bartek commented on code in PR #1737:
URL: https://github.com/apache/tika/pull/1737#discussion_r1583509595


##
tika-pipes/tika-fetchers/tika-fetcher-http/src/main/java/org/apache/tika/pipes/fetcher/http/HttpFetcher.java:
##
@@ -143,10 +143,24 @@ public InputStream fetch(String fetchKey, Metadata 
metadata) throws IOException,
 .setMaxRedirects(maxRedirects)
 .setRedirectsEnabled(true).build();
 get.setConfig(requestConfig);
-if (! StringUtils.isBlank(userAgent)) {
+setHttpRequestHeaders(metadata, get);
+return execute(get, metadata, httpClient, true);
+}
+
+private void setHttpRequestHeaders(Metadata metadata, HttpGet get) {
+if (!StringUtils.isBlank(userAgent)) {
 get.setHeader(USER_AGENT, userAgent);
 }
-return execute(get, metadata, httpClient, true);
+// additional http request headers can be sent in here.
+String[] httpRequestHeaders = metadata.getValues("httpRequestHeaders");

Review Comment:
   Should the key here be `httpHeaders` as per the http-fetcher schema? 
(embedding here as I am not sure where this is hosted)? It looks like this code 
is reading from the user-provided Metadata
   
   ```
   {
   "$schema": "http://json-schema.org/draft-07/schema#;,
   "type": "object",
   "properties": {
   "authScheme": {
   "type": "string"
   },
   "connectTimeout": {
   "type": "integer"
   },
   "httpHeaders": {
   "type": "array",
   "items": {
   "type": "string"
   }
   },
   "jwtExpiresInSeconds": {
   "type": "integer"
   },
   "jwtIssuer": {
   "type": "string"
   },
   "jwtPrivateKeyBase64": {
   "type": "string"
   },
   "jwtSecret": {
   "type": "string"
   },
   "jwtSubject": {
   "type": "string"
   },
   "maxConnections": {
   "type": "integer"
   },
   "maxConnectionsPerRoute": {
   "type": "integer"
   },
   "maxErrMsgSize": {
   "type": "integer"
   },
   "maxRedirects": {
   "type": "integer"
   },
   "maxSpoolSize": {
   "type": "integer"
   },
   "ntDomain": {
   "type": "string"
   },
   "overallTimeout": {
   "type": "integer"
   },
   "password": {
   "type": "string"
   },
   "proxyHost": {
   "type": "string"
   },
   "proxyPort": {
   "type": "integer"
   },
   "requestTimeout": {
   "type": "integer"
   },
   "socketTimeout": {
   "type": "integer"
   },
   "userAgent": {
   "type": "string"
   },
   "userName": {
   "type": "string"
   }
   }
   }
   ```





> HttpFetcher - add ability to send request headers
> -
>
> Key: TIKA-4247
> URL: https://issues.apache.org/jira/browse/TIKA-4247
> Project: Tika
>  Issue Type: New Feature
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> add ability to send request headers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4247) HttpFetcher - add ability to send request headers

2024-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842118#comment-17842118
 ] 

ASF GitHub Bot commented on TIKA-4247:
--

nddipiazza opened a new pull request, #1737:
URL: https://github.com/apache/tika/pull/1737

   set headers in a metadata value for "httpRequestHeaders" those will be sent 
along with http request.
   
   




> HttpFetcher - add ability to send request headers
> -
>
> Key: TIKA-4247
> URL: https://issues.apache.org/jira/browse/TIKA-4247
> Project: Tika
>  Issue Type: New Feature
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> add ability to send request headers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840860#comment-17840860
 ] 

ASF GitHub Bot commented on TIKA-4244:
--

tballison merged PR #1731:
URL: https://github.com/apache/tika/pull/1731




> Tika idenifies MIME type of ics files with html content as text/html
> 
>
> Key: TIKA-4244
> URL: https://issues.apache.org/jira/browse/TIKA-4244
> Project: Tika
>  Issue Type: Bug
>Reporter: Kartik Jain
>Priority: Major
> Attachments: Sample.ics
>
>
> When tika-core detect(InputStream input, Metadata metadata) API is used to 
> determimne the MIME type of an ics file, it returns media type `text/html`, 
> rather it should've `text/calendar`.
> For .ics files that have HTML content in them (additional attribute 
> X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such 
> files as text/html, ideally, it should come up as text/calendar, but 
> according to tika core text/html is not in the base types of text/calendar so 
> it doesn't consider the text/calendar type, however for all ics files MIME 
> type should be text/calendar



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840850#comment-17840850
 ] 

ASF GitHub Bot commented on TIKA-4244:
--

tballison opened a new pull request, #1731:
URL: https://github.com/apache/tika/pull/1731

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Tika idenifies MIME type of ics files with html content as text/html
> 
>
> Key: TIKA-4244
> URL: https://issues.apache.org/jira/browse/TIKA-4244
> Project: Tika
>  Issue Type: Bug
>Reporter: Kartik Jain
>Priority: Major
> Attachments: Sample.ics
>
>
> When tika-core detect(InputStream input, Metadata metadata) API is used to 
> determimne the MIME type of an ics file, it returns media type `text/html`, 
> rather it should've `text/calendar`.
> For .ics files that have HTML content in them (additional attribute 
> X-ALT-DESC;FMTTYPE=text/html). *tika-core* is returning the MIME type of such 
> files as text/html, ideally, it should come up as text/calendar, but 
> according to tika core text/html is not in the base types of text/calendar so 
> it doesn't consider the text/calendar type, however for all ics files MIME 
> type should be text/calendar



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4242) Tika depends on non-existing plexus-utils version

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838279#comment-17838279
 ] 

ASF GitHub Bot commented on TIKA-4242:
--

tballison merged PR #1727:
URL: https://github.com/apache/tika/pull/1727




> Tika depends on non-existing plexus-utils version
> -
>
> Key: TIKA-4242
> URL: https://issues.apache.org/jira/browse/TIKA-4242
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Björn Kautler
>Priority: Major
>
> In [https://github.com/apache/tika/pull/1461] [~tallison] moved the versions 
> to Maven properties, but unfortunately he thereby upgraded {{plexus-utils}} 
> to {{5.0.0}} which does not exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4242) Tika depends on non-existing plexus-utils version

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838271#comment-17838271
 ] 

ASF GitHub Bot commented on TIKA-4242:
--

tballison opened a new pull request, #1727:
URL: https://github.com/apache/tika/pull/1727

   …junrar exclusion for modernity
   
   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Tika depends on non-existing plexus-utils version
> -
>
> Key: TIKA-4242
> URL: https://issues.apache.org/jira/browse/TIKA-4242
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Björn Kautler
>Priority: Major
>
> In [https://github.com/apache/tika/pull/1461] [~tallison] moved the versions 
> to Maven properties, but unfortunately he thereby upgraded {{plexus-utils}} 
> to {{5.0.0}} which does not exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4242) Tika depends on non-existing plexus-utils version

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838270#comment-17838270
 ] 

ASF GitHub Bot commented on TIKA-4242:
--

tballison merged PR #1726:
URL: https://github.com/apache/tika/pull/1726




> Tika depends on non-existing plexus-utils version
> -
>
> Key: TIKA-4242
> URL: https://issues.apache.org/jira/browse/TIKA-4242
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Björn Kautler
>Priority: Major
>
> In [https://github.com/apache/tika/pull/1461] [~tallison] moved the versions 
> to Maven properties, but unfortunately he thereby upgraded {{plexus-utils}} 
> to {{5.0.0}} which does not exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4242) Tika depends on non-existing plexus-utils version

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838200#comment-17838200
 ] 

ASF GitHub Bot commented on TIKA-4242:
--

Vampire opened a new pull request, #1726:
URL: https://github.com/apache/tika/pull/1726

   (no comment)




> Tika depends on non-existing plexus-utils version
> -
>
> Key: TIKA-4242
> URL: https://issues.apache.org/jira/browse/TIKA-4242
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Björn Kautler
>Priority: Major
>
> In [https://github.com/apache/tika/pull/1461] [~tallison] moved the versions 
> to Maven properties, but unfortunately he thereby upgraded {{plexus-utils}} 
> to 5.0.0 which does not exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-04-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837378#comment-17837378
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

nddipiazza commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1566194105


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,92 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+package tika;
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+
+service Tika   {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}
+  rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {}
+  rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {}
+  rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {}
+  rpc FetchAndParseServerSideStreaming(FetchAndParseRequest)
+returns (stream FetchAndParseReply) {}
+  rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) 
+returns (stream FetchAndParseReply) {}
+}
+
+message CreateFetcherRequest {
+  string name = 1;
+  string fetcher_class = 2;

Review Comment:
   string needed so people can dynamically add them. validation will make sure 
class exists and will return nice error message





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-04-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837377#comment-17837377
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

nddipiazza commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1566193576


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,92 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+package tika;
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+
+service Tika   {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}

Review Comment:
   added not-found detection





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4235) Add pipeline parameter to OpenSearch emitter

2024-04-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834041#comment-17834041
 ] 

ASF GitHub Bot commented on TIKA-4235:
--

tballison opened a new pull request, #1709:
URL: https://github.com/apache/tika/pull/1709

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add pipeline parameter to OpenSearch emitter
> 
>
> Key: TIKA-4235
> URL: https://issues.apache.org/jira/browse/TIKA-4235
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-04-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833960#comment-17833960
 ] 

ASF GitHub Bot commented on TIKA-4232:
--

lewismc commented on PR #17:
URL: https://github.com/apache/tika-helm/pull/17#issuecomment-2037434458

   Blocked by https://issues.apache.org/jira/browse/INFRA-25667




> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.2
>
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-04-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833959#comment-17833959
 ] 

ASF GitHub Bot commented on TIKA-4233:
--

lewismc commented on PR #18:
URL: https://github.com/apache/tika-helm/pull/18#issuecomment-2037433774

   Blocked by https://issues.apache.org/jira/browse/INFRA-25667




> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.2
>
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4234) Further improvements to jdbc pipes reporter

2024-04-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833940#comment-17833940
 ] 

ASF GitHub Bot commented on TIKA-4234:
--

tballison opened a new pull request, #1708:
URL: https://github.com/apache/tika/pull/1708

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Further improvements to jdbc pipes reporter
> ---
>
> Key: TIKA-4234
> URL: https://issues.apache.org/jira/browse/TIKA-4234
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
>
> Allow users to set the table name.
> Allow users to choose whether or not to drop+create the table via the 
> reporter or whether they're responsible for creating the table.
> Allow users to configure insert/upsert/update. The default is "insert id, 
> status, timestamp".
> This and the earlier jdbc reporter introduce breaking changes and will only 
> be applied to 3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-04-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832819#comment-17832819
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1546235205


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,92 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+package tika;
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+
+service Tika   {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}
+  rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {}
+  rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {}
+  rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {}
+  rpc FetchAndParseServerSideStreaming(FetchAndParseRequest)
+returns (stream FetchAndParseReply) {}
+  rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) 
+returns (stream FetchAndParseReply) {}
+}
+
+message CreateFetcherRequest {
+  string name = 1;
+  string fetcher_class = 2;

Review Comment:
   Should this be a protobuf enum containing the constrained set of classes? Or 
does Tika need to support arbitrary strings here in case of custom fetchers not 
included in the Tika project?
   





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-04-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832817#comment-17832817
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1546232766


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,92 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+package tika;
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+
+service Tika   {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}

Review Comment:
   Could we document these RPCs to understand high level behaviour? For 
example, if I try to create a fetcher which already exists, what is the 
expected reply? Is it an error response on the RPC, will CreateFetchReply have 
error identifying information?
   
   If CreateFetcher was made idempotent, could we collapse these into a single 
RPC (UpdateFetcher), which either creates, or updates, or noops (no changes 
despite call) to the Fetcher?
   
   Don't want to over complicate the Tika side of course, but curious if we can 
improve the client interface.



##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,92 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+package tika;
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+
+service Tika   {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}

Review Comment:
   Similar to above regarding documentation, it would be great to understand 
what happens if I try to get a fetcher which does not exist. Is there a 
distinct error, or do I simply get an empty GetFetcherReply?





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4229) add microsoft graph fetcher

2024-04-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832814#comment-17832814
 ] 

ASF GitHub Bot commented on TIKA-4229:
--

bartek commented on code in PR #1698:
URL: https://github.com/apache/tika/pull/1698#discussion_r1546211893


##
tika-pipes/tika-fetchers/tika-fetcher-microsoft-graph/src/main/java/org/apache/tika/pipes/fetchers/microsoftgraph/MicrosoftGraphFetcher.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.pipes.fetchers.microsoftgraph;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Map;
+
+import com.azure.identity.ClientCertificateCredentialBuilder;
+import com.azure.identity.ClientSecretCredentialBuilder;
+import com.microsoft.graph.serviceclient.GraphServiceClient;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.tika.config.Field;
+import org.apache.tika.config.Initializable;
+import org.apache.tika.config.InitializableProblemHandler;
+import org.apache.tika.config.Param;
+import org.apache.tika.exception.TikaConfigException;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.pipes.fetcher.AbstractFetcher;
+import 
org.apache.tika.pipes.fetchers.microsoftgraph.config.ClientCertificateCredentialsConfig;
+import 
org.apache.tika.pipes.fetchers.microsoftgraph.config.ClientSecretCredentialsConfig;
+import 
org.apache.tika.pipes.fetchers.microsoftgraph.config.MsGraphFetcherConfig;
+
+/**
+ * Fetches files from Microsoft Graph API.
+ * Fetch keys are ${siteDriveId},${driveItemId}
+ */
+public class MicrosoftGraphFetcher extends AbstractFetcher implements 
Initializable {
+private static final Logger LOGGER = 
LoggerFactory.getLogger(MicrosoftGraphFetcher.class);
+private GraphServiceClient graphClient;
+private MsGraphFetcherConfig msGraphFetcherConfig;
+private long[] throttleSeconds;
+
+public MicrosoftGraphFetcher() {
+
+}
+
+public MicrosoftGraphFetcher(MsGraphFetcherConfig msGraphFetcherConfig) {
+this.msGraphFetcherConfig = msGraphFetcherConfig;
+}
+
+/**
+ * Set seconds to throttle retries as a comma-delimited list, e.g.: 
30,60,120,600
+ *
+ * @param commaDelimitedLongs
+ * @throws TikaConfigException
+ */
+@Field
+public void setThrottleSeconds(String commaDelimitedLongs) throws 
TikaConfigException {
+String[] longStrings = commaDelimitedLongs.split(",");
+long[] seconds = new long[longStrings.length];
+for (int i = 0; i < longStrings.length; i++) {
+try {
+seconds[i] = Long.parseLong(longStrings[i]);
+} catch (NumberFormatException e) {
+throw new TikaConfigException(e.getMessage());
+}
+}
+setThrottleSeconds(seconds);
+}
+
+public void setThrottleSeconds(long[] throttleSeconds) {
+this.throttleSeconds = throttleSeconds;
+}
+
+@Override
+public void initialize(Map map) {
+String[] scopes = msGraphFetcherConfig.getScopes().toArray(new 
String[0]);
+if (msGraphFetcherConfig.getCredentials() instanceof 
ClientCertificateCredentialsConfig) {
+ClientCertificateCredentialsConfig credentials =
+(ClientCertificateCredentialsConfig) 
msGraphFetcherConfig.getCredentials();
+graphClient = new GraphServiceClient(
+new 
ClientCertificateCredentialBuilder().clientId(credentials.getClientId())
+
.tenantId(credentials.getTenantId()).pfxCertificate(
+new 
ByteArrayInputStream(credentials.getCertificateBytes()))
+
.clientCertificatePassword(credentials.getCertificatePassword())
+.build(), scopes);
+} else if (msGraphFetcherConfig.getCredentials() instanceof 
ClientSecretCredentialsConfig) {
+ClientSecretCredentialsConfig credentials =
+(ClientSecretCredentialsConfig) 

[jira] [Commented] (TIKA-4229) add microsoft graph fetcher

2024-04-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832813#comment-17832813
 ] 

ASF GitHub Bot commented on TIKA-4229:
--

bartek commented on code in PR #1698:
URL: https://github.com/apache/tika/pull/1698#discussion_r1546211893


##
tika-pipes/tika-fetchers/tika-fetcher-microsoft-graph/src/main/java/org/apache/tika/pipes/fetchers/microsoftgraph/MicrosoftGraphFetcher.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.pipes.fetchers.microsoftgraph;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Map;
+
+import com.azure.identity.ClientCertificateCredentialBuilder;
+import com.azure.identity.ClientSecretCredentialBuilder;
+import com.microsoft.graph.serviceclient.GraphServiceClient;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.tika.config.Field;
+import org.apache.tika.config.Initializable;
+import org.apache.tika.config.InitializableProblemHandler;
+import org.apache.tika.config.Param;
+import org.apache.tika.exception.TikaConfigException;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.pipes.fetcher.AbstractFetcher;
+import 
org.apache.tika.pipes.fetchers.microsoftgraph.config.ClientCertificateCredentialsConfig;
+import 
org.apache.tika.pipes.fetchers.microsoftgraph.config.ClientSecretCredentialsConfig;
+import 
org.apache.tika.pipes.fetchers.microsoftgraph.config.MsGraphFetcherConfig;
+
+/**
+ * Fetches files from Microsoft Graph API.
+ * Fetch keys are ${siteDriveId},${driveItemId}
+ */
+public class MicrosoftGraphFetcher extends AbstractFetcher implements 
Initializable {
+private static final Logger LOGGER = 
LoggerFactory.getLogger(MicrosoftGraphFetcher.class);
+private GraphServiceClient graphClient;
+private MsGraphFetcherConfig msGraphFetcherConfig;
+private long[] throttleSeconds;
+
+public MicrosoftGraphFetcher() {
+
+}
+
+public MicrosoftGraphFetcher(MsGraphFetcherConfig msGraphFetcherConfig) {
+this.msGraphFetcherConfig = msGraphFetcherConfig;
+}
+
+/**
+ * Set seconds to throttle retries as a comma-delimited list, e.g.: 
30,60,120,600
+ *
+ * @param commaDelimitedLongs
+ * @throws TikaConfigException
+ */
+@Field
+public void setThrottleSeconds(String commaDelimitedLongs) throws 
TikaConfigException {
+String[] longStrings = commaDelimitedLongs.split(",");
+long[] seconds = new long[longStrings.length];
+for (int i = 0; i < longStrings.length; i++) {
+try {
+seconds[i] = Long.parseLong(longStrings[i]);
+} catch (NumberFormatException e) {
+throw new TikaConfigException(e.getMessage());
+}
+}
+setThrottleSeconds(seconds);
+}
+
+public void setThrottleSeconds(long[] throttleSeconds) {
+this.throttleSeconds = throttleSeconds;
+}
+
+@Override
+public void initialize(Map map) {
+String[] scopes = msGraphFetcherConfig.getScopes().toArray(new 
String[0]);
+if (msGraphFetcherConfig.getCredentials() instanceof 
ClientCertificateCredentialsConfig) {
+ClientCertificateCredentialsConfig credentials =
+(ClientCertificateCredentialsConfig) 
msGraphFetcherConfig.getCredentials();
+graphClient = new GraphServiceClient(
+new 
ClientCertificateCredentialBuilder().clientId(credentials.getClientId())
+
.tenantId(credentials.getTenantId()).pfxCertificate(
+new 
ByteArrayInputStream(credentials.getCertificateBytes()))
+
.clientCertificatePassword(credentials.getCertificatePassword())
+.build(), scopes);
+} else if (msGraphFetcherConfig.getCredentials() instanceof 
ClientSecretCredentialsConfig) {
+ClientSecretCredentialsConfig credentials =
+(ClientSecretCredentialsConfig) 

[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832676#comment-17832676
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

nddipiazza commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1545858447


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,90 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+package tika;
+
+service Tika {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}
+  rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {}
+  rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {}
+  rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {}
+  rpc FetchAndParseServerSideStreaming(FetchAndParseRequest) returns (stream 
FetchAndParseReply) {}
+  rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) returns 
(stream FetchAndParseReply) {}
+}
+
+message CreateFetcherRequest {
+  string name = 1;

Review Comment:
   yes we are using "name" as the ID. @tballison any thoughts here? maybe we 
should rename that for 3.x





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832675#comment-17832675
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

nddipiazza commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1545858324


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##


Review Comment:
   i did the normal proto linter. i'm going to leave the other stuff there that 
buf extension stuff didn't see to add much value for my context and added hours





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832568#comment-17832568
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1545596130


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##
@@ -0,0 +1,90 @@
+// Copyright 2015 The gRPC Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+syntax = "proto3";
+
+option java_multiple_files = true;
+option java_package = "org.apache.tika";
+option java_outer_classname = "TikaProto";
+option objc_class_prefix = "HLW";
+
+package tika;
+
+service Tika {
+  rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {}
+  rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {}
+  rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {}
+  rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {}
+  rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {}
+  rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {}
+  rpc FetchAndParseServerSideStreaming(FetchAndParseRequest) returns (stream 
FetchAndParseReply) {}
+  rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) returns 
(stream FetchAndParseReply) {}
+}
+
+message CreateFetcherRequest {
+  string name = 1;

Review Comment:
   Must `name` be unique across all initialized fetchers? `name` to me implies 
it's a descriptive label, is this more of an ID?
   
   Use case I am thinking of if I create multiple fetchers with the same class. 
Right now I would create a unique name for each one. Is that the correct 
expectation?





> Grpc + Tika Pipes - pipe iterator and emitter
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Add full tika-pipes support of grpc
>  * pipe iterator
>  * fetcher
>  * emitter
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832514#comment-17832514
 ] 

ASF GitHub Bot commented on TIKA-4233:
--

lewismc opened a new pull request, #18:
URL: https://github.com/apache/tika-helm/pull/18

   Addresses https://issues.apache.org/jira/browse/TIKA-4233




> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.2
>
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832513#comment-17832513
 ] 

ASF GitHub Bot commented on TIKA-4232:
--

lewismc opened a new pull request, #17:
URL: https://github.com/apache/tika-helm/pull/17

   PR to address https://issues.apache.org/jira/browse/TIKA-4232




> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.2
>
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4227) Register tika-helm Chart in artifacthub.io

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832491#comment-17832491
 ] 

ASF GitHub Bot commented on TIKA-4227:
--

lewismc merged PR #16:
URL: https://github.com/apache/tika-helm/pull/16




> Register tika-helm Chart in artifacthub.io
> --
>
> Key: TIKA-4227
> URL: https://issues.apache.org/jira/browse/TIKA-4227
> Project: Tika
>  Issue Type: Task
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.9.2
>
>
> [https://artifacthub.io/] represents the most popular search interface for 
> (amongst lots of other artifacts) Helm Charts.
> This task will register the tika-helm Chart with [https://artifacthub.io/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4227) Register tika-helm Chart in artifacthub.io

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832490#comment-17832490
 ] 

ASF GitHub Bot commented on TIKA-4227:
--

lewismc opened a new pull request, #16:
URL: https://github.com/apache/tika-helm/pull/16

   PR for https://issues.apache.org/jira/browse/TIKA-4227




> Register tika-helm Chart in artifacthub.io
> --
>
> Key: TIKA-4227
> URL: https://issues.apache.org/jira/browse/TIKA-4227
> Project: Tika
>  Issue Type: Task
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.9.2
>
>
> [https://artifacthub.io/] represents the most popular search interface for 
> (amongst lots of other artifacts) Helm Charts.
> This task will register the tika-helm Chart with [https://artifacthub.io/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832336#comment-17832336
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##


Review Comment:
   For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I 
am syncing it to a local repository for development purposes) and here's the 
report:
   
   ```
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be 
lower_snake_case, such as "page_number".
   services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should 
be lower_snake_case, such as "num_fetchers_per_page".
   services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be 
lower_snake_case, such as "get_fetcher_reply".
   Generating protobufs for ./proto/pbingest
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name 

[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832337#comment-17832337
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##


Review Comment:
   For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I 
am syncing it to a local repository for development purposes) and here's the 
report:
   
   ```
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be 
lower_snake_case, such as "page_number".
   services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should 
be lower_snake_case, such as "num_fetchers_per_page".
   services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be 
lower_snake_case, such as "get_fetcher_reply".
   Generating protobufs for ./proto/pbingest
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name 

  1   2   3   4   5   6   7   8   9   10   >