[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832336#comment-17832336
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##


Review Comment:
   For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I 
am syncing it to a local repository for development purposes) and here's the 
report:
   
   ```
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be 
lower_snake_case, such as "page_number".
   services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should 
be lower_snake_case, such as "num_fetchers_per_page".
   services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be 
lower_snake_case, such as "get_fetcher_reply".
   Generating protobufs for ./proto/pbingest
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name 

[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832337#comment-17832337
 ] 

ASF GitHub Bot commented on TIKA-4181:
--

bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##


Review Comment:
   For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I 
am syncing it to a local repository for development purposes) and here's the 
report:
   
   ```
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be 
lower_snake_case, such as "page_number".
   services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should 
be lower_snake_case, such as "num_fetchers_per_page".
   services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be 
lower_snake_case, such as "get_fetcher_reply".
   Generating protobufs for ./proto/pbingest
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name 

Re: [PR] TIKA-4181 - Tika Pipes Grpc Server [tika]

2024-03-29 Thread via GitHub


bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##


Review Comment:
   For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I 
am syncing it to a local repository for development purposes) and here's the 
report:
   
   ```
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be 
lower_snake_case, such as "page_number".
   services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should 
be lower_snake_case, such as "num_fetchers_per_page".
   services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be 
lower_snake_case, such as "get_fetcher_reply".
   Generating protobufs for ./proto/pbingest
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be 
lower_snake_case, such as "page_number".
   services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should 
be lower_snake_case, such as "num_fetchers_per_page".
   

Re: [PR] TIKA-4181 - Tika Pipes Grpc Server [tika]

2024-03-29 Thread via GitHub


bartek commented on code in PR #1702:
URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545


##
tika-pipes/tika-grpc/src/main/proto/tika.proto:
##


Review Comment:
   For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I 
am syncing it to a local repository for development purposes) and here's the 
report:
   
   ```
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be 
lower_snake_case, such as "page_number".
   services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should 
be lower_snake_case, such as "num_fetchers_per_page".
   services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be 
lower_snake_case, such as "get_fetcher_reply".
   Generating protobufs for ./proto/pbingest
   services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed 
with "Service".
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:36:40:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseServerSideStreamingRequest" or 
"TikaFetchAndParseServerSideStreamingRequest".
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as 
the request or response type for multiple RPCs.
   services/tika/pbtika/tika.proto:37:50:RPC request type 
"FetchAndParseRequest" should be named 
"FetchAndParseBiDirectionalStreamingRequest" or 
"TikaFetchAndParseBiDirectionalStreamingRequest".
   services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be 
lower_snake_case, such as "fetcher_name".
   services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be 
lower_snake_case, such as "fetch_key".
   services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be 
lower_snake_case, such as "fetcher_class".
   services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be 
lower_snake_case, such as "page_number".
   services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should 
be lower_snake_case, such as "num_fetchers_per_page".
   

[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832293#comment-17832293
 ] 

Aamir commented on TIKA-4231:
-

No, this doesn't look better. Actually, I would say that it looks worse than 
before.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832291#comment-17832291
 ] 

Tilman Hausherr commented on TIKA-4231:
---

I have attached an extraction with pdfbox 2.0.31:  [^arabic-pdfbox.txt] 
is this better, or not? I've added a BOM and removed the 00 bytes. In the tika 
extraction there are many "ef bf bd" bytes instead which is the utf8 
replacement character �.

A possible explanation why Adobe Reader works better is that this file uses the 
"ActualText"-feature which PDFBox doesn't support (PDFBOX-3248).

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4231:
--
Attachment: arabic-pdfbox.txt

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aamir updated TIKA-4231:

Affects Version/s: 2.9.1

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aamir updated TIKA-4231:

Description: 
Attached is a PDF with arabic text in it. 
When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
characters. 

The generated text doc is also attached which contains the parsed text. 

Most of the other Arabic PDFs parse fine, but this one is giving this output. 

  was:
Attached is a PDF with arabic text in it. 
When parsed using tika version 2.6.0, it produces gibberish characters. 

The generated text doc is also attached which contains the parsed text. 

Most of the other Arabic PDFs parse fine, but this one is giving this output. 


> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832289#comment-17832289
 ] 

Aamir commented on TIKA-4231:
-

The problem persists with 2.9.1
I am updating the versions in this ticket as well so that it is clear that the 
latest version has the issue as well.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832284#comment-17832284
 ] 

Tilman Hausherr commented on TIKA-4231:
---

This doesn't change my argument. The latest version is 2.9.1, please try with 
that one.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aamir updated TIKA-4231:

Description: 
Attached is a PDF with arabic text in it. 
When parsed using tika version 2.6.0, it produces gibberish characters. 

The generated text doc is also attached which contains the parsed text. 

Most of the other Arabic PDFs parse fine, but this one is giving this output. 

  was:
Attached is a PDF with arabic text in it. 
When parsed using PDFBox version 2.6.0, it produces gibberish characters. 

The generated text doc is also attached which contains the parsed text. 

Most of the other Arabic PDFs parse fine, but this one is giving this output. 


> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832260#comment-17832260
 ] 

Aamir commented on TIKA-4231:
-

Sorry, I meant tika-parsers-standard-package 2.6.0

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using PDFBox version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832258#comment-17832258
 ] 

Tilman Hausherr commented on TIKA-4231:
---

The current tika version is 2.9.1, soon to be 2.9.2. There is no "PDFBox 
version 2.6.0".

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using PDFBox version 2.6.0, it produces gibberish characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)
Aamir created TIKA-4231:
---

 Summary: Parsing Arabic PDF is returning bad data
 Key: TIKA-4231
 URL: https://issues.apache.org/jira/browse/TIKA-4231
 Project: Tika
  Issue Type: Bug
Affects Versions: 2.6.0
 Environment: I am using Java 18. And using maven dependency 
tika-parsers-standard-package 
([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]

 
Reporter: Aamir
 Attachments: arabic.pdf, arabic.txt

Attached is a PDF with arabic text in it. 
When parsed using PDFBox version 2.6.0, it produces gibberish characters. 

The generated text doc is also attached which contains the parsed text. 

Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] Tika 4181 grpc [tika]

2024-03-29 Thread via GitHub


nddipiazza opened a new pull request, #1702:
URL: https://github.com/apache/tika/pull/1702

   Add an Apache Tika GRPC Server


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: JUnit4 dependency with Grpc

2024-03-29 Thread Nicholas DiPiazza
Never mind - found a way to make it work with junit5 with some googling

On Fri, Mar 29, 2024 at 3:01 AM Nicholas DiPiazza <
nicholas.dipia...@gmail.com> wrote:

> Is there some easy way I can relax the Junit4 ban for the Gprc service?
>
>
>


JUnit4 dependency with Grpc

2024-03-29 Thread Nicholas DiPiazza
Is there some easy way I can relax the Junit4 ban for the Gprc service?


[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

2024-03-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832055#comment-17832055
 ] 

ASF GitHub Bot commented on TIKA-2696:
--

THausherr commented on PR #246:
URL: https://github.com/apache/tika/pull/246#issuecomment-2026763252

   This is a closed issue from years ago, please ask this in the user's mailing 
list (don't forget to subscribe) or on stackoverflow.com. 




> Support output of Tesseract OSD output for psm mode 0
> -
>
> Key: TIKA-2696
> URL: https://issues.apache.org/jira/browse/TIKA-2696
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: August Valera
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.2.0
>
>
> TIKA-2357 added support for additional PSM (page segmentation modes) for 
> Tesseract OCR, including mode 0, which is {{Orientation and script detection 
> (OSD) only}}, meaning it does not perform OCR, just outputs orientation and 
> script information.
> An example usage of mode 0:
> {code:java}
> $ tesseract infile.png outfile --psm 0 -l osd
> {code}
> In this mode, the usual {{outfile.txt}} is not created. Instead, and similar 
> to other modes that run OSD in addition to extraction, the result is an 
> {{outfile.osd}} file, like so:
> {code:java}
> Page 1
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 212
> Page number: 0
> Orientation in degrees: 0
> Rotate: 0
> Orientation confidence: 13.73
> Script: Latin
> Script confidence: 4.78
> {code}
> However, {{TesseractOCRParser#parse(...)}} is 
> [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437]
>  to only read the contents of {{outfile.txt}} (alternatively 
> {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input.
> This is consistent with Tika's goal to output extracted text, but against the 
> intention of the user expecting OSD output.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-2696 Add support for OSD output, contributed by @4U6U57 [tika]

2024-03-29 Thread via GitHub


THausherr commented on PR #246:
URL: https://github.com/apache/tika/pull/246#issuecomment-2026763252

   This is a closed issue from years ago, please ask this in the user's mailing 
list (don't forget to subscribe) or on stackoverflow.com. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Bump commons-io:commons-io from 2.15.1 to 2.16.0 [tika]

2024-03-29 Thread via GitHub


THausherr merged PR #1701:
URL: https://github.com/apache/tika/pull/1701


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

2024-03-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832050#comment-17832050
 ] 

ASF GitHub Bot commented on TIKA-2696:
--

Tarik37 commented on PR #246:
URL: https://github.com/apache/tika/pull/246#issuecomment-2026729362

   Hello, I am currently using the Tika 2.9.1 server version and need the 
output of the OSD in my metadata, particularly the value of the script (Latin, 
Cyrillic, etc.). So my questions are the following:
   Does my server version of Tika integrate it? Is it possible?
   If yes, how can I configure my Tika server?
   Thanks for your work (and also english is not m'y native language)




> Support output of Tesseract OSD output for psm mode 0
> -
>
> Key: TIKA-2696
> URL: https://issues.apache.org/jira/browse/TIKA-2696
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: August Valera
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.2.0
>
>
> TIKA-2357 added support for additional PSM (page segmentation modes) for 
> Tesseract OCR, including mode 0, which is {{Orientation and script detection 
> (OSD) only}}, meaning it does not perform OCR, just outputs orientation and 
> script information.
> An example usage of mode 0:
> {code:java}
> $ tesseract infile.png outfile --psm 0 -l osd
> {code}
> In this mode, the usual {{outfile.txt}} is not created. Instead, and similar 
> to other modes that run OSD in addition to extraction, the result is an 
> {{outfile.osd}} file, like so:
> {code:java}
> Page 1
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 212
> Page number: 0
> Orientation in degrees: 0
> Rotate: 0
> Orientation confidence: 13.73
> Script: Latin
> Script confidence: 4.78
> {code}
> However, {{TesseractOCRParser#parse(...)}} is 
> [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437]
>  to only read the contents of {{outfile.txt}} (alternatively 
> {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input.
> This is consistent with Tika's goal to output extracted text, but against the 
> intention of the user expecting OSD output.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-2696 Add support for OSD output, contributed by @4U6U57 [tika]

2024-03-29 Thread via GitHub


Tarik37 commented on PR #246:
URL: https://github.com/apache/tika/pull/246#issuecomment-2026729362

   Hello, I am currently using the Tika 2.9.1 server version and need the 
output of the OSD in my metadata, particularly the value of the script (Latin, 
Cyrillic, etc.). So my questions are the following:
   Does my server version of Tika integrate it? Is it possible?
   If yes, how can I configure my Tika server?
   Thanks for your work (and also english is not m'y native language)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org