[jira] [Commented] (TIKA-3417) Running tika-docker as non-root user

2021-05-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351863#comment-17351863
 ] 

ASF GitHub Bot commented on TIKA-3417:
--

lewismc commented on pull request #4:
URL: https://github.com/apache/tika-docker/pull/4#issuecomment-848865170


   @philipsoutham I can replicate the build and test results
   ```
   404c9ade89429296fb846060d2d3f13105b6f9b1bf8d96e9998fae304470c863
   Image: apache/tika:1.26 - Passed
   1.26
   1.26
   d0bb05c60afff50ed8d6c84995984dc7d8ecd0cedce8e044f9f60470bcc4aac9
   Image: apache/tika:1.26-full - Passed
   1.26-full
   1.26-full
   ```
   I experienced NO issues with the docker compositions for 
`docker-compose-tika-customocr.yml`, `docker-compose-tika-grobid.yml`, 
`docker-compose-tika-ner.yml` or `docker-compose-tika-vision.yml`. 
   
   I was **UNABLE** to reproduce the issue you describe above regarding the NER 
example. Can you provide more detail so I can reproduce?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Running tika-docker as non-root user
> 
>
> Key: TIKA-3417
> URL: https://issues.apache.org/jira/browse/TIKA-3417
> Project: Tika
>  Issue Type: Improvement
>  Components: docker, tika-docker
>Reporter: Lewis John McGibbney
>Assignee: Philip Southam
>Priority: Major
>
> The PR and context can be found at 
> https://github.com/apache/tika-docker/pull/4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-05-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349384#comment-17349384
 ] 

ASF GitHub Bot commented on TIKA-3400:
--

tballison commented on pull request #441:
URL: https://github.com/apache/tika/pull/441#issuecomment-846109800


   I was just able to replicate that in Java 11 on a Mac.  ubuntu w Java 8 
passes... Ugh... I pushed a simple fix for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-05-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349050#comment-17349050
 ] 

ASF GitHub Bot commented on TIKA-3400:
--

kamaci commented on pull request #441:
URL: https://github.com/apache/tika/pull/441#issuecomment-845736121


   @tballison I've updated the PR. Checks fail due to `MP4ParserTest.java:101`. 
I don't get same exception at my local environment. Do you have any idea about 
the reason?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3357) Remove ambiguity in request handlers

2021-05-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347618#comment-17347618
 ] 

ASF GitHub Bot commented on TIKA-3357:
--

tballison merged pull request #430:
URL: https://github.com/apache/tika/pull/430


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove ambiguity in request handlers
> 
>
> Key: TIKA-3357
> URL: https://issues.apache.org/jira/browse/TIKA-3357
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 2.0.0, 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> In Tika server, if there is a request with Accept */* or multiple accept, 
> that matches with multiple resource handler, then it throws Warning and leads 
> to somewhat uncertain handling.
>  
> This should be programmatically controlled, to with maintain consistency or 
> change standards n future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3403) Create example for Transcription

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346429#comment-17346429
 ] 

ASF GitHub Bot commented on TIKA-3403:
--

tballison merged pull request #444:
URL: https://github.com/apache/tika/pull/444


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create example for Transcription
> 
>
> Key: TIKA-3403
> URL: https://issues.apache.org/jira/browse/TIKA-3403
> Project: Tika
>  Issue Type: Improvement
>  Components: transcription
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0
>
>
> Post-TIKA-94, we lack a transcription tutorial.
> I have implemented a tutorial and several improvements for the 
> [AmazonTranscribe|https://github.com/apache/tika/blob/main/tika-transcribe/src/main/java/org/apache/tika/transcribe/AmazonTranscribe.java].
> PR coming up!!!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346388#comment-17346388
 ] 

ASF GitHub Bot commented on TIKA-3400:
--

kamaci commented on a change in pull request #441:
URL: https://github.com/apache/tika/pull/441#discussion_r633784794



##
File path: 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/onenote/FileNode.java
##
@@ -257,11 +257,11 @@ public void print(OneNoteDocument document, OneNotePtr 
pointer, int indentLevel)
 subType.revisionManifest.revisionRole);
 
 }
-if ((gctxid != ExtendedGUID.nil() ||
+if ((!gctxid.equals(ExtendedGUID.nil()) ||

Review comment:
   You are welcome!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346380#comment-17346380
 ] 

ASF GitHub Bot commented on TIKA-3400:
--

kamaci commented on a change in pull request #441:
URL: https://github.com/apache/tika/pull/441#discussion_r633781531



##
File path: 
tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/tokens/URLEmailNormalizingFilterFactory.java
##
@@ -69,11 +69,10 @@ public boolean incrementToken() throws IOException {
 return false;
 }
 //== is actually substantially faster than .equals(String)
-if (typeAtt.type() == 
UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]) {
+if 
(typeAtt.type().equals(UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]))
 {

Review comment:
   OK. I think that they had to use enum instead of a string array for such 
a thing :blush: I'll rollback that lines at my PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346376#comment-17346376
 ] 

ASF GitHub Bot commented on TIKA-3400:
--

tballison commented on a change in pull request #441:
URL: https://github.com/apache/tika/pull/441#discussion_r633762174



##
File path: 
tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/tokens/URLEmailNormalizingFilterFactory.java
##
@@ -69,11 +69,10 @@ public boolean incrementToken() throws IOException {
 return false;
 }
 //== is actually substantially faster than .equals(String)
-if (typeAtt.type() == 
UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]) {
+if 
(typeAtt.type().equals(UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]))
 {

Review comment:
   This relies on the Lucene not changing the underlying static strings: 
https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizer.java#L61

##
File path: 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/onenote/FileNode.java
##
@@ -257,11 +257,11 @@ public void print(OneNoteDocument document, OneNotePtr 
pointer, int indentLevel)
 subType.revisionManifest.revisionRole);
 
 }
-if ((gctxid != ExtendedGUID.nil() ||
+if ((!gctxid.equals(ExtendedGUID.nil()) ||

Review comment:
   To be clear, I'm not asking you to do the static thing on this issue.  
Your catch is important.  Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346377#comment-17346377
 ] 

ASF GitHub Bot commented on TIKA-3400:
--

kamaci commented on a change in pull request #441:
URL: https://github.com/apache/tika/pull/441#discussion_r633748947



##
File path: 
tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/tokens/URLEmailNormalizingFilterFactory.java
##
@@ -69,11 +69,10 @@ public boolean incrementToken() throws IOException {
 return false;
 }
 //== is actually substantially faster than .equals(String)
-if (typeAtt.type() == 
UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]) {
+if 
(typeAtt.type().equals(UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]))
 {

Review comment:
   So, parameter of the `TypeAttribute#setType` can be exactly that String 
(`UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]`) ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3317) Tika Pipes - add a solr fetch iterator

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346372#comment-17346372
 ] 

ASF GitHub Bot commented on TIKA-3317:
--

nddipiazza closed pull request #412:
URL: https://github.com/apache/tika/pull/412


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Pipes - add a solr fetch iterator
> --
>
> Key: TIKA-3317
> URL: https://issues.apache.org/jira/browse/TIKA-3317
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-pipes
>Affects Versions: 2.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add a solr-fetch-iterator to tika-pipes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3402) Remove Redundant Local Variables

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346367#comment-17346367
 ] 

ASF GitHub Bot commented on TIKA-3402:
--

tballison merged pull request #443:
URL: https://github.com/apache/tika/pull/443


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove Redundant Local Variables
> 
>
> Key: TIKA-3402
> URL: https://issues.apache.org/jira/browse/TIKA-3402
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Minor
> Fix For: 2.0.0
>
>
> Redundant local variables should be removed except for code readability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3317) Tika Pipes - add a solr fetch iterator

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346361#comment-17346361
 ] 

ASF GitHub Bot commented on TIKA-3317:
--

nddipiazza commented on pull request #412:
URL: https://github.com/apache/tika/pull/412#issuecomment-841848299


   closed in lieu of https://github.com/apache/tika/pull/445


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Pipes - add a solr fetch iterator
> --
>
> Key: TIKA-3317
> URL: https://issues.apache.org/jira/browse/TIKA-3317
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-pipes
>Affects Versions: 2.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add a solr-fetch-iterator to tika-pipes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3399) Fix Non-Atomic Operations on Volatile Fields

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346358#comment-17346358
 ] 

ASF GitHub Bot commented on TIKA-3399:
--

tballison merged pull request #440:
URL: https://github.com/apache/tika/pull/440


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Non-Atomic Operations on Volatile Fields
> 
>
> Key: TIKA-3399
> URL: https://issues.apache.org/jira/browse/TIKA-3399
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 2.0.0
>
>
> It is possible for the value of the volatile field at non-atomic operations 
> to change between the read and the write, possibly invalidating the operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3401) Remove Pointless Bitwise Expressions

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346356#comment-17346356
 ] 

ASF GitHub Bot commented on TIKA-3401:
--

tballison merged pull request #442:
URL: https://github.com/apache/tika/pull/442


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove Pointless Bitwise Expressions
> 
>
> Key: TIKA-3401
> URL: https://issues.apache.org/jira/browse/TIKA-3401
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Minor
> Fix For: 2.0.0
>
>
> Pointless bitwise expressions should be removed for better readability of the 
> code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346347#comment-17346347
 ] 

ASF GitHub Bot commented on TIKA-3400:
--

tballison commented on a change in pull request #441:
URL: https://github.com/apache/tika/pull/441#discussion_r633735419



##
File path: 
tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/tokens/URLEmailNormalizingFilterFactory.java
##
@@ -69,11 +69,10 @@ public boolean incrementToken() throws IOException {
 return false;
 }
 //== is actually substantially faster than .equals(String)
-if (typeAtt.type() == 
UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]) {
+if 
(typeAtt.type().equals(UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]))
 {

Review comment:
   This was done out of a notional sense of efficiency. I'm not sure we 
need to change it.

##
File path: 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/onenote/FileNode.java
##
@@ -257,11 +257,11 @@ public void print(OneNoteDocument document, OneNotePtr 
pointer, int indentLevel)
 subType.revisionManifest.revisionRole);
 
 }
-if ((gctxid != ExtendedGUID.nil() ||
+if ((!gctxid.equals(ExtendedGUID.nil()) ||

Review comment:
   Good catch!  We should probably make a static constant ExtendedGUID.NIL 
to avoid unnecessary object creation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346311#comment-17346311
 ] 

ASF GitHub Bot commented on TIKA-3400:
--

tballison commented on a change in pull request #441:
URL: https://github.com/apache/tika/pull/441#discussion_r633736826



##
File path: 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/onenote/FileNode.java
##
@@ -257,11 +257,11 @@ public void print(OneNoteDocument document, OneNotePtr 
pointer, int indentLevel)
 subType.revisionManifest.revisionRole);
 
 }
-if ((gctxid != ExtendedGUID.nil() ||
+if ((!gctxid.equals(ExtendedGUID.nil()) ||

Review comment:
   Good catch!  We should probably make a static constant ExtendedGUID.NIL 
to avoid unnecessary object creation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346308#comment-17346308
 ] 

ASF GitHub Bot commented on TIKA-3400:
--

tballison commented on a change in pull request #441:
URL: https://github.com/apache/tika/pull/441#discussion_r633735419



##
File path: 
tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/tokens/URLEmailNormalizingFilterFactory.java
##
@@ -69,11 +69,10 @@ public boolean incrementToken() throws IOException {
 return false;
 }
 //== is actually substantially faster than .equals(String)
-if (typeAtt.type() == 
UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]) {
+if 
(typeAtt.type().equals(UAX29URLEmailTokenizer.TOKEN_TYPES[UAX29URLEmailTokenizer.URL]))
 {

Review comment:
   This was done out of a notional sense of efficiency. I'm not sure we 
need to change it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3399) Fix Non-Atomic Operations on Volatile Fields

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346304#comment-17346304
 ] 

ASF GitHub Bot commented on TIKA-3399:
--

tballison merged pull request #440:
URL: https://github.com/apache/tika/pull/440


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Non-Atomic Operations on Volatile Fields
> 
>
> Key: TIKA-3399
> URL: https://issues.apache.org/jira/browse/TIKA-3399
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 2.0.0
>
>
> It is possible for the value of the volatile field at non-atomic operations 
> to change between the read and the write, possibly invalidating the operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3402) Remove Redundant Local Variables

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346290#comment-17346290
 ] 

ASF GitHub Bot commented on TIKA-3402:
--

tballison merged pull request #443:
URL: https://github.com/apache/tika/pull/443


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove Redundant Local Variables
> 
>
> Key: TIKA-3402
> URL: https://issues.apache.org/jira/browse/TIKA-3402
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Minor
> Fix For: 1.27
>
>
> Redundant local variables should be removed except for code readability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3401) Remove Pointless Bitwise Expressions

2021-05-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346291#comment-17346291
 ] 

ASF GitHub Bot commented on TIKA-3401:
--

tballison merged pull request #442:
URL: https://github.com/apache/tika/pull/442


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove Pointless Bitwise Expressions
> 
>
> Key: TIKA-3401
> URL: https://issues.apache.org/jira/browse/TIKA-3401
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Minor
> Fix For: 1.27
>
>
> Pointless bitwise expressions should be removed for better readability of the 
> code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3317) Tika Pipes - add a solr fetch iterator

2021-05-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345764#comment-17345764
 ] 

ASF GitHub Bot commented on TIKA-3317:
--

nddipiazza commented on pull request #412:
URL: https://github.com/apache/tika/pull/412#issuecomment-841848299


   closed in lieu of https://github.com/apache/tika/pull/445


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Pipes - add a solr fetch iterator
> --
>
> Key: TIKA-3317
> URL: https://issues.apache.org/jira/browse/TIKA-3317
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-pipes
>Affects Versions: 2.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add a solr-fetch-iterator to tika-pipes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3317) Tika Pipes - add a solr fetch iterator

2021-05-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345748#comment-17345748
 ] 

ASF GitHub Bot commented on TIKA-3317:
--

nddipiazza closed pull request #412:
URL: https://github.com/apache/tika/pull/412


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Pipes - add a solr fetch iterator
> --
>
> Key: TIKA-3317
> URL: https://issues.apache.org/jira/browse/TIKA-3317
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-pipes
>Affects Versions: 2.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add a solr-fetch-iterator to tika-pipes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3357) Remove ambiguity in request handlers

2021-05-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344782#comment-17344782
 ] 

ASF GitHub Bot commented on TIKA-3357:
--

Subhajitdas298 commented on pull request #430:
URL: https://github.com/apache/tika/pull/430#issuecomment-841399256


   > @Subhajitdas298 I didn't get a chance to try yet. Does this PR address the 
annoying logging output we see when starting tika-server?
   
   Hi, this does not affect any startup logs. This only deals with, choice of 
REST handler method. When multiple handlers are valid, say in case of "Accept: 
*/*" header, this priority list will decide which one to choose.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove ambiguity in request handlers
> 
>
> Key: TIKA-3357
> URL: https://issues.apache.org/jira/browse/TIKA-3357
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 2.0.0, 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> In Tika server, if there is a request with Accept */* or multiple accept, 
> that matches with multiple resource handler, then it throws Warning and leads 
> to somewhat uncertain handling.
>  
> This should be programmatically controlled, to with maintain consistency or 
> change standards n future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3340) LanguageProfile for Myanmar

2021-05-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344682#comment-17344682
 ] 

ASF GitHub Bot commented on TIKA-3340:
--

lewismc commented on pull request #421:
URL: https://github.com/apache/tika/pull/421#issuecomment-841330976


   @arky can you please update this PR so we can review and attempt to merge 
into main? Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LanguageProfile for Myanmar
> ---
>
> Key: TIKA-3340
> URL: https://issues.apache.org/jira/browse/TIKA-3340
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Arky
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: 20210401-model.report.txt, 20210413.report.txt, 
> lang_comparisons.xlsx, table-summarized-truncated.txt.gz
>
>
> A language profile for detecting Myanmar/Burmese (my).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3357) Remove ambiguity in request handlers

2021-05-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344680#comment-17344680
 ] 

ASF GitHub Bot commented on TIKA-3357:
--

lewismc commented on pull request #430:
URL: https://github.com/apache/tika/pull/430#issuecomment-841330165


   @Subhajitdas298 I didn't get a chance to try yet. Does this PR address the 
annoying logging output we see when starting tika-server?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove ambiguity in request handlers
> 
>
> Key: TIKA-3357
> URL: https://issues.apache.org/jira/browse/TIKA-3357
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 2.0.0, 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> In Tika server, if there is a request with Accept */* or multiple accept, 
> that matches with multiple resource handler, then it throws Warning and leads 
> to somewhat uncertain handling.
>  
> This should be programmatically controlled, to with maintain consistency or 
> change standards n future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3367) Add Bill of Materials (BOM) artifact

2021-05-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344679#comment-17344679
 ] 

ASF GitHub Bot commented on TIKA-3367:
--

lewismc commented on pull request #431:
URL: https://github.com/apache/tika/pull/431#issuecomment-841328978


   Have not tried this yet @grossws but I will try to this weekend :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Bill of Materials (BOM) artifact
> 
>
> Key: TIKA-3367
> URL: https://issues.apache.org/jira/browse/TIKA-3367
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3403) Create example for Transcription

2021-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344269#comment-17344269
 ] 

ASF GitHub Bot commented on TIKA-3403:
--

rohan2810 commented on pull request #444:
URL: https://github.com/apache/tika/pull/444#issuecomment-840971142


   Sure @lewismc 
   @phantuanminh @abehara2 @nprate2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create example for Transcription
> 
>
> Key: TIKA-3403
> URL: https://issues.apache.org/jira/browse/TIKA-3403
> Project: Tika
>  Issue Type: Improvement
>  Components: transcription
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0
>
>
> Post-TIKA-94, we lack a transcription tutorial.
> I have implemented a tutorial and several improvements for the 
> [AmazonTranscribe|https://github.com/apache/tika/blob/main/tika-transcribe/src/main/java/org/apache/tika/transcribe/AmazonTranscribe.java].
> PR coming up!!!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3403) Create example for Transcription

2021-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344260#comment-17344260
 ] 

ASF GitHub Bot commented on TIKA-3403:
--

lewismc commented on pull request #444:
URL: https://github.com/apache/tika/pull/444#issuecomment-840967172


   @rohan2810 can you please tag the rest of the HackIllinois crew? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create example for Transcription
> 
>
> Key: TIKA-3403
> URL: https://issues.apache.org/jira/browse/TIKA-3403
> Project: Tika
>  Issue Type: Improvement
>  Components: transcription
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0
>
>
> Post-TIKA-94, we lack a transcription tutorial.
> I have implemented a tutorial and several improvements for the 
> [AmazonTranscribe|https://github.com/apache/tika/blob/main/tika-transcribe/src/main/java/org/apache/tika/transcribe/AmazonTranscribe.java].
> PR coming up!!!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3403) Create example for Transcription

2021-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344259#comment-17344259
 ] 

ASF GitHub Bot commented on TIKA-3403:
--

lewismc opened a new pull request #444:
URL: https://github.com/apache/tika/pull/444


   This issue addresses https://issues.apache.org/jira/browse/TIKA-3403
   In addition to implementing the example file, it proposes the following 
improvements
   * minor upgrade of aws libraries to `1.11.1018`
   * adds a new configuration option for the AWS transcriber allowing client to 
write to a specific region cf. `transcribe.REGION`
   * makes use of 
[SelectObjectContentRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/SelectObjectContentRequest.html)
 which filters the contents of an Amazon S3 object (transcription) based on a 
simple Structured Query Language (SQL) statement. In the request, along with 
the SQL expression, we specify JSON as the data serialization format of the 
object. Amazon S3 uses this to parse object data into records, and returns only 
records that match the specified SQL expression. In our case this means we ONLY 
return the transcription text. This dramatically (orders of magnitude) reduces 
the amount of data we egress from s3 to client.
   * the implementation will now automatically create the bucket (to store the 
transcription) if one does not already exist. This is a merely a utility 
feature.
   * introduces a LOT of exception handling and checks which will assist the 
client in debugging errors/anomalies. 
   * Reformatted GoogleTranslator.java with 4-space indents.
   
   Thanks about it.
   
   CC @rohan2810 FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Create example for Transcription
> 
>
> Key: TIKA-3403
> URL: https://issues.apache.org/jira/browse/TIKA-3403
> Project: Tika
>  Issue Type: Improvement
>  Components: transcription
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0
>
>
> Post-TIKA-94, we lack a transcription tutorial.
> I have implemented a tutorial and several improvements for the 
> [AmazonTranscribe|https://github.com/apache/tika/blob/main/tika-transcribe/src/main/java/org/apache/tika/transcribe/AmazonTranscribe.java].
> PR coming up!!!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3402) Remove Redundant Local Variables

2021-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344086#comment-17344086
 ] 

ASF GitHub Bot commented on TIKA-3402:
--

kamaci opened a new pull request #443:
URL: https://github.com/apache/tika/pull/443


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove Redundant Local Variables
> 
>
> Key: TIKA-3402
> URL: https://issues.apache.org/jira/browse/TIKA-3402
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Minor
> Fix For: 1.27
>
>
> Redundant local variables should be removed except for code readability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3401) Remove Pointless Bitwise Expressions

2021-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344067#comment-17344067
 ] 

ASF GitHub Bot commented on TIKA-3401:
--

kamaci opened a new pull request #442:
URL: https://github.com/apache/tika/pull/442


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove Pointless Bitwise Expressions
> 
>
> Key: TIKA-3401
> URL: https://issues.apache.org/jira/browse/TIKA-3401
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Minor
> Fix For: 1.27
>
>
> Pointless bitwise expressions should be removed for better readability of the 
> code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-1570) Seeking a stop method for better use with Apache Commons Daemon

2021-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344020#comment-17344020
 ] 

ASF GitHub Bot commented on TIKA-1570:
--

erotavlas edited a comment on pull request #324:
URL: https://github.com/apache/tika/pull/324#issuecomment-840559104


   Any update on this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Seeking a stop method for better use with Apache Commons Daemon
> ---
>
> Key: TIKA-1570
> URL: https://issues.apache.org/jira/browse/TIKA-1570
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.7
>Reporter: Jason Borg
>Priority: Minor
>
> I've got tika-server-1.7.jar from http://tika.apache.org/download.html
> I've downloaded v1.0.15 of the Windows binaries for Apache Commons Daemon 
> from http://commons.apache.org/proper/commons-daemon/binaries.html
> I can get Tika started as a service, but I can't determine what to use for a 
> stop method.
> prunsrv.exe //IS//tika-daemon --DisplayName "Tika Daemon" --Classpath 
> "C:\Tika Service\tika-server-1.7.jar" --StartClass 
> "org.apache.tika.server.TikaServerCli" --StopClass 
> "org.apache.tika.server.TikaServerCli" --StartMethod main --StopMethod main 
> --Description "Tika Daemon Windows Service" --StartMode java --StopMode java
> This starts, and works as I'd hope, but when trying to stop the service it 
> doesn't respond. Obviously org.apache.tika.server.TikaServerCli.main(string[] 
> args) isn't a suitable stop method, but I'm lost for alternatives.
> Using Daemon in exe mode works for start, but gives inconsistent results for 
> stop. Adding a stop method to Tika would be ideal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-1570) Seeking a stop method for better use with Apache Commons Daemon

2021-05-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343854#comment-17343854
 ] 

ASF GitHub Bot commented on TIKA-1570:
--

erotavlas commented on pull request #324:
URL: https://github.com/apache/tika/pull/324#issuecomment-840559104


   Any update on this?  How come it hasn't been merged yet?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Seeking a stop method for better use with Apache Commons Daemon
> ---
>
> Key: TIKA-1570
> URL: https://issues.apache.org/jira/browse/TIKA-1570
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.7
>Reporter: Jason Borg
>Priority: Minor
>
> I've got tika-server-1.7.jar from http://tika.apache.org/download.html
> I've downloaded v1.0.15 of the Windows binaries for Apache Commons Daemon 
> from http://commons.apache.org/proper/commons-daemon/binaries.html
> I can get Tika started as a service, but I can't determine what to use for a 
> stop method.
> prunsrv.exe //IS//tika-daemon --DisplayName "Tika Daemon" --Classpath 
> "C:\Tika Service\tika-server-1.7.jar" --StartClass 
> "org.apache.tika.server.TikaServerCli" --StopClass 
> "org.apache.tika.server.TikaServerCli" --StartMethod main --StopMethod main 
> --Description "Tika Daemon Windows Service" --StartMode java --StopMode java
> This starts, and works as I'd hope, but when trying to stop the service it 
> doesn't respond. Obviously org.apache.tika.server.TikaServerCli.main(string[] 
> args) isn't a suitable stop method, but I'm lost for alternatives.
> Using Daemon in exe mode works for start, but gives inconsistent results for 
> stop. Adding a stop method to Tika would be ideal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-05-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343629#comment-17343629
 ] 

ASF GitHub Bot commented on TIKA-3400:
--

kamaci opened a new pull request #441:
URL: https://github.com/apache/tika/pull/441


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use equals for Object and String Comparison Instead of ==
> -
>
> Key: TIKA-3400
> URL: https://issues.apache.org/jira/browse/TIKA-3400
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> equals() is used for object and string comparison but == compares them by 
> identity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3399) Fix Non-Atomic Operations on Volatile Fields

2021-05-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343432#comment-17343432
 ] 

ASF GitHub Bot commented on TIKA-3399:
--

tballison commented on pull request #440:
URL: https://github.com/apache/tika/pull/440#issuecomment-839990196


   Thank you @kamaci!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Non-Atomic Operations on Volatile Fields
> 
>
> Key: TIKA-3399
> URL: https://issues.apache.org/jira/browse/TIKA-3399
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> It is possible for the value of the volatile field at non-atomic operations 
> to change between the read and the write, possibly invalidating the operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3399) Fix Non-Atomic Operations on Volatile Fields

2021-05-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343429#comment-17343429
 ] 

ASF GitHub Bot commented on TIKA-3399:
--

kamaci opened a new pull request #440:
URL: https://github.com/apache/tika/pull/440


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Non-Atomic Operations on Volatile Fields
> 
>
> Key: TIKA-3399
> URL: https://issues.apache.org/jira/browse/TIKA-3399
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> It is possible for the value of the volatile field at non-atomic operations 
> to change between the read and the write, possibly invalidating the operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3398) Tidy Up Code for Performance Improvements

2021-05-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343346#comment-17343346
 ] 

ASF GitHub Bot commented on TIKA-3398:
--

tballison merged pull request #439:
URL: https://github.com/apache/tika/pull/439


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tidy Up Code for Performance Improvements
> -
>
> Key: TIKA-3398
> URL: https://issues.apache.org/jira/browse/TIKA-3398
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> Codebase has some performance issues as like:
>  * Concatenating strings in loops
>  * Redundant calls
>  * Does not breaking loops when necessary
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3398) Tidy Up Code for Performance Improvements

2021-05-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1734#comment-1734
 ] 

ASF GitHub Bot commented on TIKA-3398:
--

kamaci opened a new pull request #439:
URL: https://github.com/apache/tika/pull/439


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tidy Up Code for Performance Improvements
> -
>
> Key: TIKA-3398
> URL: https://issues.apache.org/jira/browse/TIKA-3398
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> Codebase has some performance issues as like:
>  * Concatenating strings in loops
>  * Redundant calls
>  * Does not breaking loops when necessary
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3395) Make Inner Classes Static If Possible to Prevent Memory Leaks

2021-05-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343236#comment-17343236
 ] 

ASF GitHub Bot commented on TIKA-3395:
--

tballison merged pull request #438:
URL: https://github.com/apache/tika/pull/438


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Make Inner Classes Static If Possible to Prevent Memory Leaks
> -
>
> Key: TIKA-3395
> URL: https://issues.apache.org/jira/browse/TIKA-3395
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> A static inner class does not keep an implicit reference to its enclosing 
> instance. This prevents a common cause of memory leaks and uses less memory 
> per instance of the class.
> Details can be found here: 
> [https://www.infoworld.com/article/3526554/avoid-memory-leaks-in-inner-classes.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3395) Make Inner Classes Static If Possible to Prevent Memory Leaks

2021-05-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342903#comment-17342903
 ] 

ASF GitHub Bot commented on TIKA-3395:
--

kamaci opened a new pull request #438:
URL: https://github.com/apache/tika/pull/438


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Make Inner Classes Static If Possible to Prevent Memory Leaks
> -
>
> Key: TIKA-3395
> URL: https://issues.apache.org/jira/browse/TIKA-3395
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> A static inner class does not keep an implicit reference to its enclosing 
> instance. This prevents a common cause of memory leaks and uses less memory 
> per instance of the class.
> Details can be found here: 
> [https://www.infoworld.com/article/3526554/avoid-memory-leaks-in-inner-classes.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3390) Migrate Language Level to Java 8

2021-05-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342567#comment-17342567
 ] 

ASF GitHub Bot commented on TIKA-3390:
--

tballison merged pull request #437:
URL: https://github.com/apache/tika/pull/437


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Migrate Language Level to Java 8
> 
>
> Key: TIKA-3390
> URL: https://issues.apache.org/jira/browse/TIKA-3390
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Minor
> Fix For: 1.27
>
>
> Apache Tika supports JDK 8. However, source code does not use the power of 
> new syntax and improvements since Java 5. This issue aims to migrate the most 
> recent supported JDK level to have better readability and performant code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3390) Migrate Language Level to Java 8

2021-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342256#comment-17342256
 ] 

ASF GitHub Bot commented on TIKA-3390:
--

kamaci commented on a change in pull request #437:
URL: https://github.com/apache/tika/pull/437#discussion_r629789195



##
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/onenote/ExtendedGUID.java
##
@@ -35,7 +35,7 @@ public static ExtendedGUID nil() {
 @Override
 public int compareTo(ExtendedGUID other) {
 if (other.guid.equals(guid)) {
-new Long(n).compareTo(other.n);
+return Long.compare(n, other.n);

Review comment:
   I've changed here too since it seems like a bug.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Migrate Language Level to Java 8
> 
>
> Key: TIKA-3390
> URL: https://issues.apache.org/jira/browse/TIKA-3390
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Minor
> Fix For: 1.27
>
>
> Apache Tika supports JDK 8. However, source code does not use the power of 
> new syntax and improvements since Java 5. This issue aims to migrate the most 
> recent supported JDK level to have better readability and performant code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3390) Migrate Language Level to Java 8

2021-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342236#comment-17342236
 ] 

ASF GitHub Bot commented on TIKA-3390:
--

kamaci commented on a change in pull request #437:
URL: https://github.com/apache/tika/pull/437#discussion_r629789195



##
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/onenote/ExtendedGUID.java
##
@@ -35,7 +35,7 @@ public static ExtendedGUID nil() {
 @Override
 public int compareTo(ExtendedGUID other) {
 if (other.guid.equals(guid)) {
-new Long(n).compareTo(other.n);
+return Long.compare(n, other.n);

Review comment:
   I've changed it like that since it seems like a bug.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Migrate Language Level to Java 8
> 
>
> Key: TIKA-3390
> URL: https://issues.apache.org/jira/browse/TIKA-3390
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Minor
> Fix For: 1.27
>
>
> Apache Tika supports JDK 8. However, source code does not use the power of 
> new syntax and improvements since Java 5. This issue aims to migrate the most 
> recent supported JDK level to have better readability and performant code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3390) Migrate Language Level to Java 8

2021-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342231#comment-17342231
 ] 

ASF GitHub Bot commented on TIKA-3390:
--

kamaci opened a new pull request #437:
URL: https://github.com/apache/tika/pull/437


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Migrate Language Level to Java 8
> 
>
> Key: TIKA-3390
> URL: https://issues.apache.org/jira/browse/TIKA-3390
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Minor
> Fix For: 1.27
>
>
> Apache Tika supports JDK 8. However, source code does not use the power of 
> new syntax and improvements since Java 5. This issue aims to migrate the most 
> recent supported JDK level to have better readability and performant code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3389) Close Open Resources

2021-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341963#comment-17341963
 ] 

ASF GitHub Bot commented on TIKA-3389:
--

tballison merged pull request #436:
URL: https://github.com/apache/tika/pull/436


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Close Open Resources
> 
>
> Key: TIKA-3389
> URL: https://issues.apache.org/jira/browse/TIKA-3389
> Project: Tika
>  Issue Type: Bug
>  Components: languageidentifier, parser, serialization, translation
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> Connections, streams, files, and other classes that implement the 
> {{Closeable}} interface or its super-interface, {{AutoCloseable}}, needs to 
> be closed after use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3389) Close Open Resources

2021-05-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341494#comment-17341494
 ] 

ASF GitHub Bot commented on TIKA-3389:
--

kamaci commented on pull request #436:
URL: https://github.com/apache/tika/pull/436#issuecomment-835784457


   Some resources should not be closed and I've didn't close them. This PR 
passes all the tests right now. Anyone is welcome to review for such cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Close Open Resources
> 
>
> Key: TIKA-3389
> URL: https://issues.apache.org/jira/browse/TIKA-3389
> Project: Tika
>  Issue Type: Bug
>  Components: languageidentifier, parser, serialization, translation
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> Connections, streams, files, and other classes that implement the 
> {{Closeable}} interface or its super-interface, {{AutoCloseable}}, needs to 
> be closed after use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3389) Close Open Resources

2021-05-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341395#comment-17341395
 ] 

ASF GitHub Bot commented on TIKA-3389:
--

kamaci opened a new pull request #436:
URL: https://github.com/apache/tika/pull/436


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Close Open Resources
> 
>
> Key: TIKA-3389
> URL: https://issues.apache.org/jira/browse/TIKA-3389
> Project: Tika
>  Issue Type: Bug
>  Components: core, languageidentifier, parser, serialization, 
> translation
>Affects Versions: 1.26
>Reporter: Furkan Kamaci
>Priority: Major
> Fix For: 1.27
>
>
> Connections, streams, files, and other classes that implement the 
> {{Closeable}} interface or its super-interface, {{AutoCloseable}}, needs to 
> be closed after use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-94) Speech-to-text transcription

2021-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338663#comment-17338663
 ] 

ASF GitHub Bot commented on TIKA-94:


lewismc commented on pull request #406:
URL: https://github.com/apache/tika/pull/406#issuecomment-831595625


   @tballison I know you and I spoke about refactoring this as simple a parser 
interface... 
   I would like to merge it for the time being and I can begin to work on the 
refactoring in a separate ticket.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Speech-to-text transcription
> 
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-94) Speech-to-text transcription

2021-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338662#comment-17338662
 ] 

ASF GitHub Bot commented on TIKA-94:


lewismc merged pull request #406:
URL: https://github.com/apache/tika/pull/406


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Speech-to-text transcription
> 
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3383) Add .asf.yaml files to all Tika repositories

2021-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338644#comment-17338644
 ] 

ASF GitHub Bot commented on TIKA-3383:
--

lewismc merged pull request #434:
URL: https://github.com/apache/tika/pull/434


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add .asf.yaml files to all Tika repositories
> 
>
> Key: TIKA-3383
> URL: https://issues.apache.org/jira/browse/TIKA-3383
> Project: Tika
>  Issue Type: Improvement
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.27
>
>
> I propose to add the 
> [.asf.yaml|https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features]
>  file to tika, tika-docker and tika-helm repositories to assist with services 
> integration. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-94) Speech-to-text transcription

2021-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338639#comment-17338639
 ] 

ASF GitHub Bot commented on TIKA-94:


lewismc opened a new pull request #406:
URL: https://github.com/apache/tika/pull/406


   This is a WIP on the work we are doing as fulfillment of the Hackillinois 
program.
   We will be adding to this and I will be making comments in here.
   Great work team on the work so far... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Speech-to-text transcription
> 
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-94) Speech-to-text transcription

2021-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338638#comment-17338638
 ] 

ASF GitHub Bot commented on TIKA-94:


lewismc closed pull request #435:
URL: https://github.com/apache/tika/pull/435


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Speech-to-text transcription
> 
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-94) Speech-to-text transcription

2021-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338631#comment-17338631
 ] 

ASF GitHub Bot commented on TIKA-94:


lewismc closed pull request #406:
URL: https://github.com/apache/tika/pull/406


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Speech-to-text transcription
> 
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3383) Add .asf.yaml files to all Tika repositories

2021-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338629#comment-17338629
 ] 

ASF GitHub Bot commented on TIKA-3383:
--

lewismc opened a new pull request #434:
URL: https://github.com/apache/tika/pull/434


   This issue addresses https://issues.apache.org/jira/browse/TIKA-3383


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add .asf.yaml files to all Tika repositories
> 
>
> Key: TIKA-3383
> URL: https://issues.apache.org/jira/browse/TIKA-3383
> Project: Tika
>  Issue Type: Improvement
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.27
>
>
> I propose to add the 
> [.asf.yaml|https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features]
>  file to tika, tika-docker and tika-helm repositories to assist with services 
> integration. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-94) Speech-to-text transcription

2021-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338628#comment-17338628
 ] 

ASF GitHub Bot commented on TIKA-94:


lewismc opened a new pull request #406:
URL: https://github.com/apache/tika/pull/406


   This is a WIP on the work we are doing as fulfillment of the Hackillinois 
program.
   We will be adding to this and I will be making comments in here.
   Great work team on the work so far... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Speech-to-text transcription
> 
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-94) Speech-to-text transcription

2021-05-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338627#comment-17338627
 ] 

ASF GitHub Bot commented on TIKA-94:


lewismc closed pull request #406:
URL: https://github.com/apache/tika/pull/406


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Speech-to-text transcription
> 
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337883#comment-17337883
 ] 

ASF GitHub Bot commented on TIKA-3329:
--

chrismattmann merged pull request #419:
URL: https://github.com/apache/tika/pull/419


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>Assignee: Chris Mattmann
>Priority: Major
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has about 80 languages). 
> This issue is for integrating RTG Translator into tika-translate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337882#comment-17337882
 ] 

ASF GitHub Bot commented on TIKA-3329:
--

chrismattmann commented on pull request #419:
URL: https://github.com/apache/tika/pull/419#issuecomment-830692404


   OK had to make some changes so that it would pass the forbidden APIs (RTG 
and RTGTest) and also to tika-server classic modules that were failing 
checkstyle (probably has for a while but my Maven version seems to care and 
have checkstyle as a failure case which I haven't seen). Also had to update the 
test b/c the translation returned slightly different than the original PR (it 
had an extra comma, and a period). Anyways it's fixed and works!
   
   ```
   [INFO] tika-server-classic  SUCCESS [ 16.967 
s]
   [INFO] tika-server-client . SUCCESS [  1.039 
s]
   [INFO] Apache Tika eval ... SUCCESS [  0.063 
s]
   [INFO] tika-eval-core . SUCCESS [ 13.070 
s]
   [INFO] tika-eval-app .. SUCCESS [ 21.834 
s]
   [INFO] Apache Tika fuzzing  SUCCESS [  0.938 
s]
   [INFO] Apache Tika examples ... SUCCESS [  9.334 
s]
   [INFO] Apache Tika Java-7 Components .. SUCCESS [  1.628 
s]
   [INFO] Apache Tika  SUCCESS [  0.024 
s]
   [INFO] 

   [INFO] BUILD SUCCESS
   [INFO] 

   [INFO] Total time:  10:42 min
   [INFO] Finished at: 2021-05-01T13:41:08-07:00
   [INFO] 

   [2]-  Doneemacs 
tika-translate/src/test/java/org/apache/tika/language/translate/RTGTranslatorTest.java
   [3]+  Doneemacs RTGTranslator.java  (wd: 
~/git/tika/tika-translate/src/main/java/org/apache/tika/language/translate)
   (wd now: ~/git/tika)
   (base) mattmann@proscuitto:~/git/tika$ 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>Assignee: Chris Mattmann
>Priority: Major
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has about 80 languages). 
> This issue is for integrating RTG Translator into tika-translate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337876#comment-17337876
 ] 

ASF GitHub Bot commented on TIKA-3329:
--

chrismattmann commented on pull request #419:
URL: https://github.com/apache/tika/pull/419#issuecomment-830679775


   going to test this today in Tika. If everything passes, I'll get it 
committed and then work to integrate it directly into tika python as the 
default translation package. Thanks @thammegowda !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>Assignee: Chris Mattmann
>Priority: Major
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has about 80 languages). 
> This issue is for integrating RTG Translator into tika-translate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336999#comment-17336999
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

Ryan421 commented on a change in pull request #433:
URL: https://github.com/apache/tika/pull/433#discussion_r623514986



##
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
##
@@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, 
ArchiveEntry entry,
 XHTMLContentHandler xhtml)
 throws SAXException, IOException, TikaException {
 String name = entry.getName();
+
+//Try to detect charset of archive entry in case of non-unicode 
filename is used
+if (entry instanceof ZipArchiveEntry) {
+detector.setText(((ZipArchiveEntry) entry).getRawName());

Review comment:
   No need to be sorry ^^, It was really my fault when moving the code 
block from our project to here and not properly checked. Really appreciate your 
review and suggestions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335483#comment-17335483
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

tballison merged pull request #433:
URL: https://github.com/apache/tika/pull/433


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335476#comment-17335476
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

tballison commented on a change in pull request #433:
URL: https://github.com/apache/tika/pull/433#discussion_r623052485



##
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
##
@@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, 
ArchiveEntry entry,
 XHTMLContentHandler xhtml)
 throws SAXException, IOException, TikaException {
 String name = entry.getName();
+
+//Try to detect charset of archive entry in case of non-unicode 
filename is used
+if (entry instanceof ZipArchiveEntry) {
+detector.setText(((ZipArchiveEntry) entry).getRawName());

Review comment:
   Sorry, please forgive me.  I meant embarrassing for me because I figured 
I was missing something!!!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335171#comment-17335171
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

Ryan421 commented on pull request #433:
URL: https://github.com/apache/tika/pull/433#issuecomment-828968582


   Add unit test with a dummy EncodingDetector to verify the charset detection 
flow is executed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335103#comment-17335103
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

Ryan421 commented on a change in pull request #433:
URL: https://github.com/apache/tika/pull/433#discussion_r622696337



##
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
##
@@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, 
ArchiveEntry entry,
 XHTMLContentHandler xhtml)
 throws SAXException, IOException, TikaException {
 String name = entry.getName();
+
+//Try to detect charset of archive entry in case of non-unicode 
filename is used
+if (entry instanceof ZipArchiveEntry) {
+detector.setText(((ZipArchiveEntry) entry).getRawName());

Review comment:
   Yes, it is really embarrasssing, will change to extend 
AbstractEncodingDetectorParser and using getEncodingDetector to do the job, 
thank you so much.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335005#comment-17335005
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

tballison commented on a change in pull request #433:
URL: https://github.com/apache/tika/pull/433#discussion_r622541769



##
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
##
@@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, 
ArchiveEntry entry,
 XHTMLContentHandler xhtml)
 throws SAXException, IOException, TikaException {
 String name = entry.getName();
+
+//Try to detect charset of archive entry in case of non-unicode 
filename is used
+if (entry instanceof ZipArchiveEntry) {
+detector.setText(((ZipArchiveEntry) entry).getRawName());

Review comment:
   This is embarrassing, but where is the detector initialized...how are 
you getting it?
   
   Maybe have PackageParser extend AbstractEncodingDetectorParser?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334428#comment-17334428
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

Ryan421 opened a new pull request #433:
URL: https://github.com/apache/tika/pull/433


   Fixes #TIKA-3374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)

2021-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331107#comment-17331107
 ] 

ASF GitHub Bot commented on TIKA-3368:
--

grossws opened a new pull request #432:
URL: https://github.com/apache/tika/pull/432


   Fixes #TIKA-3368


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Bill of Materials (BOM) artifact (Tika 1.x)
> ---
>
> Key: TIKA-3368
> URL: https://issues.apache.org/jira/browse/TIKA-3368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 1.27
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3367) Add Bill of Materials (BOM) artifact

2021-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331102#comment-17331102
 ] 

ASF GitHub Bot commented on TIKA-3367:
--

grossws opened a new pull request #431:
URL: https://github.com/apache/tika/pull/431


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Bill of Materials (BOM) artifact
> 
>
> Key: TIKA-3367
> URL: https://issues.apache.org/jira/browse/TIKA-3367
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3353) Tika Server Production ready monitoring (Prometheus and JMX)

2021-04-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329165#comment-17329165
 ] 

ASF GitHub Bot commented on TIKA-3353:
--

tballison commented on pull request #429:
URL: https://github.com/apache/tika/pull/429#issuecomment-824905053


   @Subhajitdas298 , let's hold off on this for main/2.x for now.  I'd like to 
upgrade to log4j2, and I'd like to make these extra dependencies loaded only if 
the user selects "monitor"...prob next week?
   
   Thank you for your work on this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Server Production ready monitoring (Prometheus and JMX)
> 
>
> Key: TIKA-3353
> URL: https://issues.apache.org/jira/browse/TIKA-3353
> Project: Tika
>  Issue Type: New Feature
>  Components: server
>Affects Versions: 2.0.0, 1.26
>Reporter: Subhajit Das
>Priority: Major
>  Labels: features
>
> Tika Server only has Server status (/status and it's MBean).
> The MBean, in conjunction with JMX exporter can be used for Prometheus 
> exporting. But it will be oblivious to details, such as time taken to process 
> a request and so on.
>  
> New standard metrics collection system to be implemented, with help 
> [Micrometer|https://micrometer.io/] metrics system, with [CXF 
> pluggability|https://cxf.apache.org/docs/micrometer.html].
> The metrics data can be exported to most of industry standard monitoring 
> tools format (such as Prometheus/Grafana, Gangila etc).
>  
> Prometheus and JMX metrics can be implemented with core metrics collection.
> Choice metrics reporting and nature of metrics can be configured by user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3353) Tika Server Production ready monitoring (Prometheus and JMX)

2021-04-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329163#comment-17329163
 ] 

ASF GitHub Bot commented on TIKA-3353:
--

tballison merged pull request #429:
URL: https://github.com/apache/tika/pull/429


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Server Production ready monitoring (Prometheus and JMX)
> 
>
> Key: TIKA-3353
> URL: https://issues.apache.org/jira/browse/TIKA-3353
> Project: Tika
>  Issue Type: New Feature
>  Components: server
>Affects Versions: 2.0.0, 1.26
>Reporter: Subhajit Das
>Priority: Major
>  Labels: features
>
> Tika Server only has Server status (/status and it's MBean).
> The MBean, in conjunction with JMX exporter can be used for Prometheus 
> exporting. But it will be oblivious to details, such as time taken to process 
> a request and so on.
>  
> New standard metrics collection system to be implemented, with help 
> [Micrometer|https://micrometer.io/] metrics system, with [CXF 
> pluggability|https://cxf.apache.org/docs/micrometer.html].
> The metrics data can be exported to most of industry standard monitoring 
> tools format (such as Prometheus/Grafana, Gangila etc).
>  
> Prometheus and JMX metrics can be implemented with core metrics collection.
> Choice metrics reporting and nature of metrics can be configured by user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3357) Remove ambiguity in request handlers

2021-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327090#comment-17327090
 ] 

ASF GitHub Bot commented on TIKA-3357:
--

Subhajitdas298 opened a new pull request #430:
URL: https://github.com/apache/tika/pull/430


   Added Resource comparator based to produce type.
   In an ambiguous call, request handler will be chosen based on the type of 
data it returns.
   
   *Current priority is set as:*
   MediaType.TEXT_PLAIN_TYPE,
   MediaType.APPLICATION_JSON_TYPE,
   MediaType.TEXT_HTML_TYPE,
   MediaType.TEXT_XML_TYPE
   *The lower in list (higher index value), the higher priority it has.*
   In case of no matching in this list, it will be treated as media type all.
   
   *Note: Please change priority list if some other order is more suitable.*
   
   This is kept at sync with branch_1x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove ambiguity in request handlers
> 
>
> Key: TIKA-3357
> URL: https://issues.apache.org/jira/browse/TIKA-3357
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> In Tika server, if there is a request with Accept */* or multiple accept, 
> that matches with multiple resource handler, then it throws Warning and leads 
> to somewhat uncertain handling.
>  
> This should be programmatically controlled, to with maintain consistency or 
> change standards n future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3353) Tika Server Production ready monitoring (Prometheus and JMX)

2021-04-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327069#comment-17327069
 ] 

ASF GitHub Bot commented on TIKA-3353:
--

Subhajitdas298 commented on pull request #429:
URL: https://github.com/apache/tika/pull/429#issuecomment-824499807


   @tballison Please review the PR and let me know, if anything has to be 
modified.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Server Production ready monitoring (Prometheus and JMX)
> 
>
> Key: TIKA-3353
> URL: https://issues.apache.org/jira/browse/TIKA-3353
> Project: Tika
>  Issue Type: New Feature
>  Components: server
>Affects Versions: 2.0.0, 1.26
>Reporter: Subhajit Das
>Priority: Major
>  Labels: features
>
> Tika Server only has Server status (/status and it's MBean).
> The MBean, in conjunction with JMX exporter can be used for Prometheus 
> exporting. But it will be oblivious to details, such as time taken to process 
> a request and so on.
>  
> New standard metrics collection system to be implemented, with help 
> [Micrometer|https://micrometer.io/] metrics system, with [CXF 
> pluggability|https://cxf.apache.org/docs/micrometer.html].
> The metrics data can be exported to most of industry standard monitoring 
> tools format (such as Prometheus/Grafana, Gangila etc).
>  
> Prometheus and JMX metrics can be implemented with core metrics collection.
> Choice metrics reporting and nature of metrics can be configured by user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3353) Tika Server Production ready monitoring (Prometheus and JMX)

2021-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325174#comment-17325174
 ] 

ASF GitHub Bot commented on TIKA-3353:
--

Subhajitdas298 commented on pull request #429:
URL: https://github.com/apache/tika/pull/429#issuecomment-822609910


   **Please note**: There are new dependencies added for this purpose.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Server Production ready monitoring (Prometheus and JMX)
> 
>
> Key: TIKA-3353
> URL: https://issues.apache.org/jira/browse/TIKA-3353
> Project: Tika
>  Issue Type: New Feature
>  Components: server
>Affects Versions: 2.0.0, 1.26
>Reporter: Subhajit Das
>Priority: Major
>  Labels: features
>
> Tika Server only has Server status (/status and it's MBean).
> The MBean, in conjunction with JMX exporter can be used for Prometheus 
> exporting. But it will be oblivious to details, such as time taken to process 
> a request and so on.
>  
> New standard metrics collection system to be implemented, with help 
> [Micrometer|https://micrometer.io/] metrics system, with [CXF 
> pluggability|https://cxf.apache.org/docs/micrometer.html].
> The metrics data can be exported to most of industry standard monitoring 
> tools format (such as Prometheus/Grafana, Gangila etc).
>  
> Prometheus and JMX metrics can be implemented with core metrics collection.
> Choice metrics reporting and nature of metrics can be configured by user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3353) Tika Server Production ready monitoring (Prometheus and JMX)

2021-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325170#comment-17325170
 ] 

ASF GitHub Bot commented on TIKA-3353:
--

Subhajitdas298 opened a new pull request #429:
URL: https://github.com/apache/tika/pull/429


   New standard metrics collection system to be implemented, with help 
Micrometer metrics system, with CXF pluggability.
   
   The system is extensible and can be extended t support other monitoring 
system later on.
   
   This is for Branch 1x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Server Production ready monitoring (Prometheus and JMX)
> 
>
> Key: TIKA-3353
> URL: https://issues.apache.org/jira/browse/TIKA-3353
> Project: Tika
>  Issue Type: New Feature
>  Components: server
>Affects Versions: 2.0.0, 1.26
>Reporter: Subhajit Das
>Priority: Major
>  Labels: features
>
> Tika Server only has Server status (/status and it's MBean).
> The MBean, in conjunction with JMX exporter can be used for Prometheus 
> exporting. But it will be oblivious to details, such as time taken to process 
> a request and so on.
>  
> New standard metrics collection system to be implemented, with help 
> [Micrometer|https://micrometer.io/] metrics system, with [CXF 
> pluggability|https://cxf.apache.org/docs/micrometer.html].
> The metrics data can be exported to most of industry standard monitoring 
> tools format (such as Prometheus/Grafana, Gangila etc).
>  
> Prometheus and JMX metrics can be implemented with core metrics collection.
> Choice metrics reporting and nature of metrics can be configured by user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3357) Remove ambiguity in request handlers

2021-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325038#comment-17325038
 ] 

ASF GitHub Bot commented on TIKA-3357:
--

tballison merged pull request #427:
URL: https://github.com/apache/tika/pull/427


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove ambiguity in request handlers
> 
>
> Key: TIKA-3357
> URL: https://issues.apache.org/jira/browse/TIKA-3357
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> In Tika server, if there is a request with Accept */* or multiple accept, 
> that matches with multiple resource handler, then it throws Warning and leads 
> to somewhat uncertain handling.
>  
> This should be programmatically controlled, to with maintain consistency or 
> change standards n future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3196) PackageParser should attempt to parse entries from zip files with STORED entries with data descriptor

2021-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325037#comment-17325037
 ] 

ASF GitHub Bot commented on TIKA-3196:
--

tballison commented on pull request #364:
URL: https://github.com/apache/tika/pull/364#issuecomment-822472042


   Ugh...sorry about that!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PackageParser should attempt to parse entries from zip files with STORED 
> entries with data descriptor
> -
>
> Key: TIKA-3196
> URL: https://issues.apache.org/jira/browse/TIKA-3196
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Trevor Bentley
>Priority: Major
> Fix For: 2.0.0, 1.27
>
> Attachments: OOO-107047-0.oxt-145.zip
>
>
> We are currently using tika for text extraction. Currently some sites are 
> returning zips that have entries with stored data descriptors which fail to 
> extract due to the ZipArchiveInputStream (in commons-compress) defaulting to 
> false for 'allowStoredEntriesWithDataDescriptor'.
> Since ZipArchiveInputStream has support for reading zips with data 
> descriptors we should attempt to read the zip with that feature enabled when 
> we get a data descriptor UnsupportedZipFeatureException.
> Pull Request: 
> [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3196) PackageParser should attempt to parse entries from zip files with STORED entries with data descriptor

2021-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324384#comment-17324384
 ] 

ASF GitHub Bot commented on TIKA-3196:
--

lfcnassif edited a comment on pull request #364:
URL: https://github.com/apache/tika/pull/364#issuecomment-821914480


   This was merged in 1.26 at least without fixing the thread safety issue 
noticed by Tim.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PackageParser should attempt to parse entries from zip files with STORED 
> entries with data descriptor
> -
>
> Key: TIKA-3196
> URL: https://issues.apache.org/jira/browse/TIKA-3196
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Trevor Bentley
>Priority: Major
> Fix For: 2.0.0, 1.25
>
> Attachments: OOO-107047-0.oxt-145.zip
>
>
> We are currently using tika for text extraction. Currently some sites are 
> returning zips that have entries with stored data descriptors which fail to 
> extract due to the ZipArchiveInputStream (in commons-compress) defaulting to 
> false for 'allowStoredEntriesWithDataDescriptor'.
> Since ZipArchiveInputStream has support for reading zips with data 
> descriptors we should attempt to read the zip with that feature enabled when 
> we get a data descriptor UnsupportedZipFeatureException.
> Pull Request: 
> [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3196) PackageParser should attempt to parse entries from zip files with STORED entries with data descriptor

2021-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324383#comment-17324383
 ] 

ASF GitHub Bot commented on TIKA-3196:
--

lfcnassif edited a comment on pull request #364:
URL: https://github.com/apache/tika/pull/364#issuecomment-821914480


   This was merged in 1.26 at least without fixing the thread safety issue 
noted by Tim.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PackageParser should attempt to parse entries from zip files with STORED 
> entries with data descriptor
> -
>
> Key: TIKA-3196
> URL: https://issues.apache.org/jira/browse/TIKA-3196
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Trevor Bentley
>Priority: Major
> Fix For: 2.0.0, 1.25
>
> Attachments: OOO-107047-0.oxt-145.zip
>
>
> We are currently using tika for text extraction. Currently some sites are 
> returning zips that have entries with stored data descriptors which fail to 
> extract due to the ZipArchiveInputStream (in commons-compress) defaulting to 
> false for 'allowStoredEntriesWithDataDescriptor'.
> Since ZipArchiveInputStream has support for reading zips with data 
> descriptors we should attempt to read the zip with that feature enabled when 
> we get a data descriptor UnsupportedZipFeatureException.
> Pull Request: 
> [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3196) PackageParser should attempt to parse entries from zip files with STORED entries with data descriptor

2021-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324382#comment-17324382
 ] 

ASF GitHub Bot commented on TIKA-3196:
--

lfcnassif commented on pull request #364:
URL: https://github.com/apache/tika/pull/364#issuecomment-821914480


   This have broken thread safety of PackedParser because of the created 
instance field, it should be passed through method params.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PackageParser should attempt to parse entries from zip files with STORED 
> entries with data descriptor
> -
>
> Key: TIKA-3196
> URL: https://issues.apache.org/jira/browse/TIKA-3196
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Trevor Bentley
>Priority: Major
> Fix For: 2.0.0, 1.25
>
> Attachments: OOO-107047-0.oxt-145.zip
>
>
> We are currently using tika for text extraction. Currently some sites are 
> returning zips that have entries with stored data descriptors which fail to 
> extract due to the ZipArchiveInputStream (in commons-compress) defaulting to 
> false for 'allowStoredEntriesWithDataDescriptor'.
> Since ZipArchiveInputStream has support for reading zips with data 
> descriptors we should attempt to read the zip with that feature enabled when 
> we get a data descriptor UnsupportedZipFeatureException.
> Pull Request: 
> [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3361) Improve intelligence of OCRStrategy=AUTO

2021-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324303#comment-17324303
 ] 

ASF GitHub Bot commented on TIKA-3361:
--

peterkronenberg opened a new pull request #428:
URL: https://github.com/apache/tika/pull/428


   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


>  Improve intelligence of OCRStrategy=AUTO
> -
>
> Key: TIKA-3361
> URL: https://issues.apache.org/jira/browse/TIKA-3361
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage(AUTO);
> }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> 20
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> best
> {code}
> is equivalent to
> {code:java}
> 20, 10
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> fast
> {code}
> is equivalent to
> {code:java}
> 10%, 10
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3329) RTG Translator with many-to-eng translation

2021-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324157#comment-17324157
 ] 

ASF GitHub Bot commented on TIKA-3329:
--

lewismc commented on pull request #419:
URL: https://github.com/apache/tika/pull/419#issuecomment-821753228


   +1 @thammegowda excellent job


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has about 80 languages). 
> This issue is for integrating RTG Translator into tika-translate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3357) Remove ambiguity in request handlers

2021-04-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322398#comment-17322398
 ] 

ASF GitHub Bot commented on TIKA-3357:
--

Subhajitdas298 opened a new pull request #427:
URL: https://github.com/apache/tika/pull/427


   Added Resource comparator based to produce type.
   In an ambiguous call, request handler will be chosen based on the type of 
data it returns.
   
   **Current priority is set as:**
   MediaType.TEXT_PLAIN_TYPE,
   MediaType.APPLICATION_JSON_TYPE,
   MediaType.TEXT_XML_TYPE,
   MediaType.TEXT_HTML_TYPE
   **The lower in list (higher index value), the higher priority it has.**
   In case of no matching in this list, it will be treated as media type all.
   
   **_Note: Please change priority list if some other order is more suitable._**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove ambiguity in request handlers
> 
>
> Key: TIKA-3357
> URL: https://issues.apache.org/jira/browse/TIKA-3357
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> In Tika server, if there is a request with Accept */* or multiple accept, 
> that matches with multiple resource handler, then it throws Warning and leads 
> to somewhat uncertain handling.
>  
> This should be programmatically controlled, to with maintain consistency or 
> change standards n future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3329) RTG Translator with many-to-eng translation

2021-04-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319874#comment-17319874
 ] 

ASF GitHub Bot commented on TIKA-3329:
--

kieraCurtis commented on a change in pull request #419:
URL: https://github.com/apache/tika/pull/419#discussion_r612114393



##
File path: 
tika-translate/src/main/java/org/apache/tika/language/translate/RTGTranslator.java
##
@@ -0,0 +1,137 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.tika.language.translate;
+
+import com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider;
+import org.apache.cxf.jaxrs.client.WebClient;
+import org.apache.tika.exception.TikaException;
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+import org.json.simple.parser.ParseException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.ws.rs.core.MediaType;
+import javax.ws.rs.core.Response;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+
+
+/**
+ * This translator is designed to work with a TCP-IP available
+ * RTG translation server, specifically the
+ * https://isi-nlp.github.io/rtg/#_rtg_serve;>
+ * REST-based RTG server.
+ * To get Docker image:
+ *   https://hub.docker.com/repository/docker/tgowda/rtg-model 
+ * 
+ * {code
+ * # without GPU
+ *   docker run --rm -i -p 6060:6060 tgowda/rtg-model:500toEng-v1
+ * # Or, with GPU device 0
+ *   docker run --rm -i -p 6060:6060 --gpus '"device=0"' 
tgowda/rtg-model:500toEng-v1
+ * }
+ * 
+ *
+ * If you were to interact with the server via curl a request
+ * would look as follows
+ *
+ * 
+ * {code
+ * curl --data "source=Comment allez-vous?" \
+ *  --data "source=Bonne journée" \
+ *  http://localhost:6060/translate
+ * }
+ * 
+ *
+ * RTG requires input to be pre-formatted into sentences, one per line,
+ * so this translation implementation takes care of that.
+ */
+public class RTGTranslator extends AbstractTranslator {
+
+public static final String RTG_TRANSLATE_URL_BASE = 
"http://localhost:6060;;
+public static final String RTG_PROPS = "translator.rtg.properties";
+private static final Logger LOG = 
LoggerFactory.getLogger(RTGTranslator.class);
+private WebClient client;
+private boolean isAvailable = false;
+
+public RTGTranslator() {
+String rtgBaseUrl = RTG_TRANSLATE_URL_BASE;
+Properties config = new Properties();
+try (InputStream stream = getClass().getResourceAsStream(RTG_PROPS)){

Review comment:
   thank you for your quick reply. Happy to help :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has 

[jira] [Commented] (TIKA-3329) RTG Translator with many-to-eng translation

2021-04-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319873#comment-17319873
 ] 

ASF GitHub Bot commented on TIKA-3329:
--

kieraCurtis commented on a change in pull request #419:
URL: https://github.com/apache/tika/pull/419#discussion_r612114393



##
File path: 
tika-translate/src/main/java/org/apache/tika/language/translate/RTGTranslator.java
##
@@ -0,0 +1,137 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.tika.language.translate;
+
+import com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider;
+import org.apache.cxf.jaxrs.client.WebClient;
+import org.apache.tika.exception.TikaException;
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+import org.json.simple.parser.ParseException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.ws.rs.core.MediaType;
+import javax.ws.rs.core.Response;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+
+
+/**
+ * This translator is designed to work with a TCP-IP available
+ * RTG translation server, specifically the
+ * https://isi-nlp.github.io/rtg/#_rtg_serve;>
+ * REST-based RTG server.
+ * To get Docker image:
+ *   https://hub.docker.com/repository/docker/tgowda/rtg-model 
+ * 
+ * {code
+ * # without GPU
+ *   docker run --rm -i -p 6060:6060 tgowda/rtg-model:500toEng-v1
+ * # Or, with GPU device 0
+ *   docker run --rm -i -p 6060:6060 --gpus '"device=0"' 
tgowda/rtg-model:500toEng-v1
+ * }
+ * 
+ *
+ * If you were to interact with the server via curl a request
+ * would look as follows
+ *
+ * 
+ * {code
+ * curl --data "source=Comment allez-vous?" \
+ *  --data "source=Bonne journée" \
+ *  http://localhost:6060/translate
+ * }
+ * 
+ *
+ * RTG requires input to be pre-formatted into sentences, one per line,
+ * so this translation implementation takes care of that.
+ */
+public class RTGTranslator extends AbstractTranslator {
+
+public static final String RTG_TRANSLATE_URL_BASE = 
"http://localhost:6060;;
+public static final String RTG_PROPS = "translator.rtg.properties";
+private static final Logger LOG = 
LoggerFactory.getLogger(RTGTranslator.class);
+private WebClient client;
+private boolean isAvailable = false;
+
+public RTGTranslator() {
+String rtgBaseUrl = RTG_TRANSLATE_URL_BASE;
+Properties config = new Properties();
+try (InputStream stream = getClass().getResourceAsStream(RTG_PROPS)){

Review comment:
   thank you for your quick reply. Happy




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has about 80 

[jira] [Commented] (TIKA-3329) RTG Translator with many-to-eng translation

2021-04-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319843#comment-17319843
 ] 

ASF GitHub Bot commented on TIKA-3329:
--

kieraCurtis commented on a change in pull request #419:
URL: https://github.com/apache/tika/pull/419#discussion_r612097270



##
File path: 
tika-translate/src/main/java/org/apache/tika/language/translate/RTGTranslator.java
##
@@ -0,0 +1,137 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.tika.language.translate;
+
+import com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider;
+import org.apache.cxf.jaxrs.client.WebClient;
+import org.apache.tika.exception.TikaException;
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+import org.json.simple.parser.ParseException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.ws.rs.core.MediaType;
+import javax.ws.rs.core.Response;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+
+
+/**
+ * This translator is designed to work with a TCP-IP available
+ * RTG translation server, specifically the
+ * https://isi-nlp.github.io/rtg/#_rtg_serve;>
+ * REST-based RTG server.
+ * To get Docker image:
+ *   https://hub.docker.com/repository/docker/tgowda/rtg-model 
+ * 
+ * {code
+ * # without GPU
+ *   docker run --rm -i -p 6060:6060 tgowda/rtg-model:500toEng-v1
+ * # Or, with GPU device 0
+ *   docker run --rm -i -p 6060:6060 --gpus '"device=0"' 
tgowda/rtg-model:500toEng-v1
+ * }
+ * 
+ *
+ * If you were to interact with the server via curl a request
+ * would look as follows
+ *
+ * 
+ * {code
+ * curl --data "source=Comment allez-vous?" \
+ *  --data "source=Bonne journée" \
+ *  http://localhost:6060/translate
+ * }
+ * 
+ *
+ * RTG requires input to be pre-formatted into sentences, one per line,
+ * so this translation implementation takes care of that.
+ */
+public class RTGTranslator extends AbstractTranslator {
+
+public static final String RTG_TRANSLATE_URL_BASE = 
"http://localhost:6060;;
+public static final String RTG_PROPS = "translator.rtg.properties";
+private static final Logger LOG = 
LoggerFactory.getLogger(RTGTranslator.class);
+private WebClient client;
+private boolean isAvailable = false;
+
+public RTGTranslator() {
+String rtgBaseUrl = RTG_TRANSLATE_URL_BASE;
+Properties config = new Properties();
+try (InputStream stream = getClass().getResourceAsStream(RTG_PROPS)){

Review comment:
   I detect that this code is problematic. According to the [Bad practice 
(BAD_PRACTICE)](https://spotbugs.readthedocs.io/en/stable/bugDescriptions.html#bad-practice-bad-practice),
 [UI: Usage of GetResource may be unsafe if class is extended 
(UI_INHERITANCE_UNSAFE_GETRESOURCE)](https://spotbugs.readthedocs.io/en/stable/bugDescriptions.html#ui-usage-of-getresource-may-be-unsafe-if-class-is-extended-ui-inheritance-unsafe-getresource).
   Calling this.getClass().getResource(...) could give results other than 
expected if this class is extended by a class in another package.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator 

[jira] [Commented] (TIKA-3351) Make list of parsers in metadata unique

2021-04-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319572#comment-17319572
 ] 

ASF GitHub Bot commented on TIKA-3351:
--

peterkronenberg opened a new pull request #425:
URL: https://github.com/apache/tika/pull/425


   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Make list of parsers in metadata unique
> ---
>
> Key: TIKA-3351
> URL: https://issues.apache.org/jira/browse/TIKA-3351
> Project: Tika
>  Issue Type: Improvement
>Reporter: Peter Kronenberg
>Priority: Major
>
> The Parsed_By field in the metadata can have duplicates, since some parsers 
> can be called more than one.  Make this field only contain each parser once



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3344) @POST methods does not accept same X-Tika headers as their @PUT counterpart

2021-04-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318114#comment-17318114
 ] 

ASF GitHub Bot commented on TIKA-3344:
--

tballison merged pull request #424:
URL: https://github.com/apache/tika/pull/424


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> @POST methods does not accept same X-Tika headers as their @PUT counterpart
> ---
>
> Key: TIKA-3344
> URL: https://issues.apache.org/jira/browse/TIKA-3344
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> Tika Server discards the X-Tika headers before processing starts.
> Thus, only PUT methods are capable of supporting X-Tika headers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3344) @POST methods does not accept same X-Tika headers as their @PUT counterpart

2021-04-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316745#comment-17316745
 ] 

ASF GitHub Bot commented on TIKA-3344:
--

Subhajitdas298 commented on pull request #424:
URL: https://github.com/apache/tika/pull/424#issuecomment-815295560


   *test failure, not build failure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> @POST methods does not accept same X-Tika headers as their @PUT counterpart
> ---
>
> Key: TIKA-3344
> URL: https://issues.apache.org/jira/browse/TIKA-3344
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> Tika Server discards the X-Tika headers before processing starts.
> Thus, only PUT methods are capable of supporting X-Tika headers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3344) @POST methods does not accept same X-Tika headers as their @PUT counterpart

2021-04-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316742#comment-17316742
 ] 

ASF GitHub Bot commented on TIKA-3344:
--

Subhajitdas298 commented on pull request #424:
URL: https://github.com/apache/tika/pull/424#issuecomment-815292528


   Note: Build failure is not changes related. Build failures are tika app 
related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> @POST methods does not accept same X-Tika headers as their @PUT counterpart
> ---
>
> Key: TIKA-3344
> URL: https://issues.apache.org/jira/browse/TIKA-3344
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> Tika Server discards the X-Tika headers before processing starts.
> Thus, only PUT methods are capable of supporting X-Tika headers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3344) @POST methods does not accept same X-Tika headers as their @PUT counterpart

2021-04-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316578#comment-17316578
 ] 

ASF GitHub Bot commented on TIKA-3344:
--

Subhajitdas298 opened a new pull request #424:
URL: https://github.com/apache/tika/pull/424


   main branch changes for the [TIKA-3344] [TIKA-3345]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> @POST methods does not accept same X-Tika headers as their @PUT counterpart
> ---
>
> Key: TIKA-3344
> URL: https://issues.apache.org/jira/browse/TIKA-3344
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> Tika Server discards the X-Tika headers before processing starts.
> Thus, only PUT methods are capable of supporting X-Tika headers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3344) @POST methods does not accept same X-Tika headers as their @PUT counterpart

2021-04-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315050#comment-17315050
 ] 

ASF GitHub Bot commented on TIKA-3344:
--

tballison merged pull request #422:
URL: https://github.com/apache/tika/pull/422


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> @POST methods does not accept same X-Tika headers as their @PUT counterpart
> ---
>
> Key: TIKA-3344
> URL: https://issues.apache.org/jira/browse/TIKA-3344
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> Tika Server discards the X-Tika headers before processing starts.
> Thus, only PUT methods are capable of supporting X-Tika headers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3317) Tika Pipes - add a solr fetch iterator

2021-04-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314327#comment-17314327
 ] 

ASF GitHub Bot commented on TIKA-3317:
--

nddipiazza commented on pull request #412:
URL: https://github.com/apache/tika/pull/412#issuecomment-812903624


   hey @tballison  i merged main into the branch. I still want to spend some 
time with Update options. But need to ask some questions to my solr guru 
friends to bless. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika Pipes - add a solr fetch iterator
> --
>
> Key: TIKA-3317
> URL: https://issues.apache.org/jira/browse/TIKA-3317
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-pipes
>Affects Versions: 2.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> Add a solr-fetch-iterator to tika-pipes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3329) RTG Translator with many-to-eng translation

2021-04-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314220#comment-17314220
 ] 

ASF GitHub Bot commented on TIKA-3329:
--

thammegowda commented on pull request #419:
URL: https://github.com/apache/tika/pull/419#issuecomment-812836701


   Wiki page created:  https://cwiki.apache.org/confluence/display/TIKA/NMT-RTG


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has about 80 languages). 
> This issue is for integrating RTG Translator into tika-translate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3344) @POST methods does not accept same X-Tika headers as their @PUT counterpart

2021-04-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313913#comment-17313913
 ] 

ASF GitHub Bot commented on TIKA-3344:
--

Subhajitdas298 opened a new pull request #422:
URL: https://github.com/apache/tika/pull/422


   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> @POST methods does not accept same X-Tika headers as their @PUT counterpart
> ---
>
> Key: TIKA-3344
> URL: https://issues.apache.org/jira/browse/TIKA-3344
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.26
>Reporter: Subhajit Das
>Priority: Major
>
> Tika Server discards the X-Tika headers before processing starts.
> Thus, only PUT methods are capable of supporting X-Tika headers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3340) LanguageProfile for Myanmar

2021-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312520#comment-17312520
 ] 

ASF GitHub Bot commented on TIKA-3340:
--

kkrugler commented on pull request #421:
URL: https://github.com/apache/tika/pull/421#issuecomment-811185369


   @arky - re using UDHR text...that's fine, but as per the **Permissions** 
section on https://www.ohchr.org/EN/UDHR/Pages/Introduction.aspx,  you would 
need to add attribution to the end of the Tika top-level `LICENSE.txt` file 
(see other examples in that file of test data).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LanguageProfile for Myanmar
> ---
>
> Key: TIKA-3340
> URL: https://issues.apache.org/jira/browse/TIKA-3340
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Arky
>Priority: Major
>
> A language profile for detecting Myanmar/Burmese (my).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3340) LanguageProfile for Myanmar

2021-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312475#comment-17312475
 ] 

ASF GitHub Bot commented on TIKA-3340:
--

arky commented on pull request #421:
URL: https://github.com/apache/tika/pull/421#issuecomment-811151917


   @kkrugler Thanks for that information, I'll add a pull request to add 
appropriate testcase for Myanmar and few other language that were introduced. 
   
   Any technical objections to using UDHR Burmese translated text as the 
testcase?
   
   https://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=bms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LanguageProfile for Myanmar
> ---
>
> Key: TIKA-3340
> URL: https://issues.apache.org/jira/browse/TIKA-3340
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Arky
>Priority: Major
>
> A language profile for detecting Myanmar/Burmese (my).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3340) LanguageProfile for Myanmar

2021-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312435#comment-17312435
 ] 

ASF GitHub Bot commented on TIKA-3340:
--

kkrugler commented on pull request #421:
URL: https://github.com/apache/tika/pull/421#issuecomment-811120110


   Hi @arky you also need to edit the `LanguageIdentifierTest.java` file, to 
add `my` to the list of languages, like this:
   
   ``` java
   private static final String[] languages = new String[] {
   // TODO - currently Estonian and Greek fail these tests.
   // Enable when language detection works better.
   "da", "de", /* "et", "el", */ "en", "es", "fi", "fr", "it",
   "lt", "my", "nl", "pt", "sv"
   };
   ```
   
   And then run `mvn clean test` from the `tika/tika-core` directory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LanguageProfile for Myanmar
> ---
>
> Key: TIKA-3340
> URL: https://issues.apache.org/jira/browse/TIKA-3340
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Arky
>Priority: Major
>
> A language profile for detecting Myanmar/Burmese (my).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3340) LanguageProfile for Myanmar

2021-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312187#comment-17312187
 ] 

ASF GitHub Bot commented on TIKA-3340:
--

arky commented on pull request #421:
URL: https://github.com/apache/tika/pull/421#issuecomment-810887814


   @kkrugler I'll be happy to contribute test cases for Myanmar. Can you please 
tell me more about how to do this?
   
   Just adding 'lang_code.test' file with 100 lines of Myanamar text is enough? 
https://github.com/apache/tika/tree/main/tika-core/src/test/resources/org/apache/tika/language
   
   How do I verify this testcase? Just  'mvn run tests...'
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LanguageProfile for Myanmar
> ---
>
> Key: TIKA-3340
> URL: https://issues.apache.org/jira/browse/TIKA-3340
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Arky
>Priority: Major
>
> A language profile for detecting Myanmar/Burmese (my).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3340) LanguageProfile for Myanmar

2021-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311842#comment-17311842
 ] 

ASF GitHub Bot commented on TIKA-3340:
--

kkrugler commented on pull request #421:
URL: https://github.com/apache/tika/pull/421#issuecomment-810625585


   Hi @arky - thanks for the PR! Would it be possible to add `my` to the list 
of languages being tested in `LanguageIdentifierTest`? You'd have to add a 
`tika-core/src/test/resources/org/apache/tika/language/my.test` file with 
Burmese as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LanguageProfile for Myanmar
> ---
>
> Key: TIKA-3340
> URL: https://issues.apache.org/jira/browse/TIKA-3340
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Arky
>Priority: Major
>
> A language profile for detecting Myanmar/Burmese (my).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3340) LanguageProfile for Myanmar

2021-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311731#comment-17311731
 ] 

ASF GitHub Bot commented on TIKA-3340:
--

arky opened a new pull request #421:
URL: https://github.com/apache/tika/pull/421


   Adds Myanmar LanguageProfile for Apache Tika 
https://issues.apache.org/jira/browse/TIKA-3340
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LanguageProfile for Myanmar
> ---
>
> Key: TIKA-3340
> URL: https://issues.apache.org/jira/browse/TIKA-3340
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Arky
>Priority: Major
>
> A language profile for detecting Myanmar/Burmese (my).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    4   5   6   7   8   9   10   11   12   13   >