[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832336#comment-17832336 ] ASF GitHub Bot commented on TIKA-4181: -- bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## Review Comment: For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I am syncing it to a local repository for development purposes) and here's the report: ``` services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be lower_snake_case, such as "page_number". services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should be lower_snake_case, such as "num_fetchers_per_page". services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be lower_snake_case, such as "get_fetcher_reply". Generating protobufs for ./proto/pbingest services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832337#comment-17832337 ] ASF GitHub Bot commented on TIKA-4181: -- bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## Review Comment: For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I am syncing it to a local repository for development purposes) and here's the report: ``` services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be lower_snake_case, such as "page_number". services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should be lower_snake_case, such as "num_fetchers_per_page". services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be lower_snake_case, such as "get_fetcher_reply". Generating protobufs for ./proto/pbingest services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name
Re: [PR] TIKA-4181 - Tika Pipes Grpc Server [tika]
bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## Review Comment: For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I am syncing it to a local repository for development purposes) and here's the report: ``` services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be lower_snake_case, such as "page_number". services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should be lower_snake_case, such as "num_fetchers_per_page". services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be lower_snake_case, such as "get_fetcher_reply". Generating protobufs for ./proto/pbingest services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be lower_snake_case, such as "page_number". services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should be lower_snake_case, such as "num_fetchers_per_page".
Re: [PR] TIKA-4181 - Tika Pipes Grpc Server [tika]
bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## Review Comment: For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I am syncing it to a local repository for development purposes) and here's the report: ``` services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be lower_snake_case, such as "page_number". services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should be lower_snake_case, such as "num_fetchers_per_page". services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be lower_snake_case, such as "get_fetcher_reply". Generating protobufs for ./proto/pbingest services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be lower_snake_case, such as "page_number". services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should be lower_snake_case, such as "num_fetchers_per_page".
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832293#comment-17832293 ] Aamir commented on TIKA-4231: - No, this doesn't look better. Actually, I would say that it looks worse than before. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832291#comment-17832291 ] Tilman Hausherr commented on TIKA-4231: --- I have attached an extraction with pdfbox 2.0.31: [^arabic-pdfbox.txt] is this better, or not? I've added a BOM and removed the 00 bytes. In the tika extraction there are many "ef bf bd" bytes instead which is the utf8 replacement character �. A possible explanation why Adobe Reader works better is that this file uses the "ActualText"-feature which PDFBox doesn't support (PDFBOX-3248). > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4231: -- Attachment: arabic-pdfbox.txt > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aamir updated TIKA-4231: Affects Version/s: 2.9.1 > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0, it produces gibberish characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aamir updated TIKA-4231: Description: Attached is a PDF with arabic text in it. When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish characters. The generated text doc is also attached which contains the parsed text. Most of the other Arabic PDFs parse fine, but this one is giving this output. was: Attached is a PDF with arabic text in it. When parsed using tika version 2.6.0, it produces gibberish characters. The generated text doc is also attached which contains the parsed text. Most of the other Arabic PDFs parse fine, but this one is giving this output. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832289#comment-17832289 ] Aamir commented on TIKA-4231: - The problem persists with 2.9.1 I am updating the versions in this ticket as well so that it is clear that the latest version has the issue as well. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0, it produces gibberish characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832284#comment-17832284 ] Tilman Hausherr commented on TIKA-4231: --- This doesn't change my argument. The latest version is 2.9.1, please try with that one. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0, it produces gibberish characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aamir updated TIKA-4231: Description: Attached is a PDF with arabic text in it. When parsed using tika version 2.6.0, it produces gibberish characters. The generated text doc is also attached which contains the parsed text. Most of the other Arabic PDFs parse fine, but this one is giving this output. was: Attached is a PDF with arabic text in it. When parsed using PDFBox version 2.6.0, it produces gibberish characters. The generated text doc is also attached which contains the parsed text. Most of the other Arabic PDFs parse fine, but this one is giving this output. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0, it produces gibberish characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832260#comment-17832260 ] Aamir commented on TIKA-4231: - Sorry, I meant tika-parsers-standard-package 2.6.0 > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using PDFBox version 2.6.0, it produces gibberish characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832258#comment-17832258 ] Tilman Hausherr commented on TIKA-4231: --- The current tika version is 2.9.1, soon to be 2.9.2. There is no "PDFBox version 2.6.0". > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using PDFBox version 2.6.0, it produces gibberish characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4231) Parsing Arabic PDF is returning bad data
Aamir created TIKA-4231: --- Summary: Parsing Arabic PDF is returning bad data Key: TIKA-4231 URL: https://issues.apache.org/jira/browse/TIKA-4231 Project: Tika Issue Type: Bug Affects Versions: 2.6.0 Environment: I am using Java 18. And using maven dependency tika-parsers-standard-package ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] Reporter: Aamir Attachments: arabic.pdf, arabic.txt Attached is a PDF with arabic text in it. When parsed using PDFBox version 2.6.0, it produces gibberish characters. The generated text doc is also attached which contains the parsed text. Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] Tika 4181 grpc [tika]
nddipiazza opened a new pull request, #1702: URL: https://github.com/apache/tika/pull/1702 Add an Apache Tika GRPC Server -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: JUnit4 dependency with Grpc
Never mind - found a way to make it work with junit5 with some googling On Fri, Mar 29, 2024 at 3:01 AM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > Is there some easy way I can relax the Junit4 ban for the Gprc service? > > >
JUnit4 dependency with Grpc
Is there some easy way I can relax the Junit4 ban for the Gprc service?
[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0
[ https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832055#comment-17832055 ] ASF GitHub Bot commented on TIKA-2696: -- THausherr commented on PR #246: URL: https://github.com/apache/tika/pull/246#issuecomment-2026763252 This is a closed issue from years ago, please ask this in the user's mailing list (don't forget to subscribe) or on stackoverflow.com. > Support output of Tesseract OSD output for psm mode 0 > - > > Key: TIKA-2696 > URL: https://issues.apache.org/jira/browse/TIKA-2696 > Project: Tika > Issue Type: Improvement > Components: ocr >Reporter: August Valera >Assignee: Tim Allison >Priority: Minor > Fix For: 2.2.0 > > > TIKA-2357 added support for additional PSM (page segmentation modes) for > Tesseract OCR, including mode 0, which is {{Orientation and script detection > (OSD) only}}, meaning it does not perform OCR, just outputs orientation and > script information. > An example usage of mode 0: > {code:java} > $ tesseract infile.png outfile --psm 0 -l osd > {code} > In this mode, the usual {{outfile.txt}} is not created. Instead, and similar > to other modes that run OSD in addition to extraction, the result is an > {{outfile.osd}} file, like so: > {code:java} > Page 1 > Warning. Invalid resolution 0 dpi. Using 70 instead. > Estimating resolution as 212 > Page number: 0 > Orientation in degrees: 0 > Rotate: 0 > Orientation confidence: 13.73 > Script: Latin > Script confidence: 4.78 > {code} > However, {{TesseractOCRParser#parse(...)}} is > [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437] > to only read the contents of {{outfile.txt}} (alternatively > {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input. > This is consistent with Tika's goal to output extracted text, but against the > intention of the user expecting OSD output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-2696 Add support for OSD output, contributed by @4U6U57 [tika]
THausherr commented on PR #246: URL: https://github.com/apache/tika/pull/246#issuecomment-2026763252 This is a closed issue from years ago, please ask this in the user's mailing list (don't forget to subscribe) or on stackoverflow.com. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Bump commons-io:commons-io from 2.15.1 to 2.16.0 [tika]
THausherr merged PR #1701: URL: https://github.com/apache/tika/pull/1701 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0
[ https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832050#comment-17832050 ] ASF GitHub Bot commented on TIKA-2696: -- Tarik37 commented on PR #246: URL: https://github.com/apache/tika/pull/246#issuecomment-2026729362 Hello, I am currently using the Tika 2.9.1 server version and need the output of the OSD in my metadata, particularly the value of the script (Latin, Cyrillic, etc.). So my questions are the following: Does my server version of Tika integrate it? Is it possible? If yes, how can I configure my Tika server? Thanks for your work (and also english is not m'y native language) > Support output of Tesseract OSD output for psm mode 0 > - > > Key: TIKA-2696 > URL: https://issues.apache.org/jira/browse/TIKA-2696 > Project: Tika > Issue Type: Improvement > Components: ocr >Reporter: August Valera >Assignee: Tim Allison >Priority: Minor > Fix For: 2.2.0 > > > TIKA-2357 added support for additional PSM (page segmentation modes) for > Tesseract OCR, including mode 0, which is {{Orientation and script detection > (OSD) only}}, meaning it does not perform OCR, just outputs orientation and > script information. > An example usage of mode 0: > {code:java} > $ tesseract infile.png outfile --psm 0 -l osd > {code} > In this mode, the usual {{outfile.txt}} is not created. Instead, and similar > to other modes that run OSD in addition to extraction, the result is an > {{outfile.osd}} file, like so: > {code:java} > Page 1 > Warning. Invalid resolution 0 dpi. Using 70 instead. > Estimating resolution as 212 > Page number: 0 > Orientation in degrees: 0 > Rotate: 0 > Orientation confidence: 13.73 > Script: Latin > Script confidence: 4.78 > {code} > However, {{TesseractOCRParser#parse(...)}} is > [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437] > to only read the contents of {{outfile.txt}} (alternatively > {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input. > This is consistent with Tika's goal to output extracted text, but against the > intention of the user expecting OSD output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-2696 Add support for OSD output, contributed by @4U6U57 [tika]
Tarik37 commented on PR #246: URL: https://github.com/apache/tika/pull/246#issuecomment-2026729362 Hello, I am currently using the Tika 2.9.1 server version and need the output of the OSD in my metadata, particularly the value of the script (Latin, Cyrillic, etc.). So my questions are the following: Does my server version of Tika integrate it? Is it possible? If yes, how can I configure my Tika server? Thanks for your work (and also english is not m'y native language) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org