[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856154#comment-16856154
 ] 

Tim Allison commented on TIKA-2790:
---

{noformat}
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 3; i++) {
sb.append("four score and seven years ago ");
}
for (int i = 0; i < 100; i++) {
sb.append("La MEL réunit 90 communes sur un territoire de près de 
650 km2 où résident plus de 1,1 million d’habitants. Située au centre d'une 
aire géographique très densément peuplée, à l’extrême ouest de la plaine 
d'Europe du Nord, elle est encadrée");
}
List results =
d.detect(sb.toString());
{noformat}

results in:
{noformat}
[LangDetectResult{lang='eng', confidence=1.0}]
{noformat}

When you get rid of the English loop, the result is 'fra' confidence=1.0.

To be clear, I think stopping short makes quite a bit of sense, and I'll see 
how we can do that in a modified version of OpenNLP.

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856135#comment-16856135
 ] 

Tim Allison edited comment on TIKA-2790 at 6/4/19 9:38 PM:
---

[~kkrugler], I may be misunderstanding the code, but it looks like Yalder is 
stopping after normalizing 85 characters and identifying 330 known ngrams.  The 
original string is 3390 characters long.


was (Author: talli...@mitre.org):
[~kkrugler], I may be misunderstanding the code, but it looks like Yalder is 
stopping after normalizing 85 characters and identifying 330 known ngrams.

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856135#comment-16856135
 ] 

Tim Allison edited comment on TIKA-2790 at 6/4/19 9:34 PM:
---

[~kkrugler], I may be misunderstanding the code, but it looks like Yalder is 
stopping after normalizing 85 characters and identifying 330 known ngrams.


was (Author: talli...@mitre.org):
[~kkrugler], I may be misunderstanding the code, but it looks like Yalder is 
stopping after normalizing 84 characters and computing 330 ngrams.

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856135#comment-16856135
 ] 

Tim Allison commented on TIKA-2790:
---

[~kkrugler], I may be misunderstanding the code, but it looks like Yalder is 
stopping after normalizing 84 characters and computing 330 ngrams.

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2790:
--
Attachment: hasEnough.png

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856117#comment-16856117
 ] 

Tim Allison commented on TIKA-2790:
---

[~kkrugler], k, I'll dig some more.  Thank you!

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip, 
> timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2790:
--
Attachment: timeVsLength.png

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip, 
> timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2790:
--
Attachment: (was: image-2019-06-04-16-46-55-286.png)

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2790:
--
Attachment: image-2019-06-04-16-46-55-286.png

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, 
> image-2019-06-04-16-46-55-286.png, langid_20190509.zip, langid_20190510.zip, 
> langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856109#comment-16856109
 ] 

Tim Allison commented on TIKA-2889:
---

the "file:" part is important; try: {{ 
-Dlog4j.configuration=file:"D:/log4j.xml"}}

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856109#comment-16856109
 ] 

Tim Allison edited comment on TIKA-2889 at 6/4/19 8:43 PM:
---

the "file:" part is important; try: 
{{-Dlog4j.configuration=file:"D:/log4j.xml"}}


was (Author: talli...@mitre.org):
the "file:" part is important; try: {{ 
-Dlog4j.configuration=file:"D:/log4j.xml"}}

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Ken Krugler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856107#comment-16856107
 ] 

Ken Krugler commented on TIKA-2790:
---

[~talli...@apache.org] - I'd have to look at the code used to generate the 
timing table, because there's a pretty high overhead for creating the detector. 
So if you aren't re-using the detector, then for shorter strings you're mostly 
measuring the startup time. If I just consider the last four entries (length 
10K -> 100K) then it's a pretty good linear relationship (R^2 = .91).

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856101#comment-16856101
 ] 

Tim Allison edited comment on TIKA-2790 at 6/4/19 8:38 PM:
---

In going down the path of sampling, or stopping short...I wanted to see how 
much text would be necessary for OpenNLP.  So, to answer the question of 
"what's the minimum length/minimum confidence after which the detector is 
always correct."  To answer that, I measured the inverse, what is the maximum 
confidence and at what length when the detector incorrectly ids a language.


In the following table, I show the maximum wrong confidence for a given 
language, the incorrectly detected language, and the text length at which that 
was incorrectly detected.  For example, at text length of 230 characters, 
OpenNLP had a confidence of 0.43 that the text was 'hrv', but it was really 
'bos'.

As the original confusion matrix shows, some lang pairs are much harder and 
require more evidence, e.g. {{ekk}} and {{est}}, {{fas}} and {{pes}}, {{hrv}} 
and {{bos}}, {{ind}} and {{sun}}, {{pus}} and {{por}}, but many languages 
require a very small amount of text...

||Lang||WrongId||MaxWrongConf||MaxWrongLength||
|ast|nno|0.03|90|
|bak|tat|0.08|70|
|bos|hrv|0.43|230|
|cat|vol|0.01|10|
|ces|slk|0.01|10|
|cym|min|0.01|10|
|dan|war|0.02|30|
|deu|war|0.01|10|
|ekk|est|0.54|310|
|eng|nan|0.01|10|
|est|ekk|0.56|250|
|fas|pes|0.52|550|
|fin|min|0.01|10|
|fra|fin|0.01|10|
|gsw|lat|0.01|10|
|hrv|bos|0.64|1010|
|hun|nob|0.01|10|
|ind|sun|0.73|810|
|isl|fao|0.02|30|
|ita|fra|0.04|110|
|jav|afr|0.02|50|
|lav|lvs|0.35|170|
|lim|epo|0.02|30|
|ltz|vol|0.01|10|
|lvs|lav|0.03|30|
|mlt|eng|0.02|50|
|msa|ind|0.45|490|
|nan|tur|0.01|10|
|nds|plt|0.01|10|
|nep|san|0.01|10|
|nld|plt|0.01|10|
|nno|nob|0.12|130|
|nob|nno|0.62|290|
|oci|ita|0.01|10|
|pes|fas|0.57|730|
|pus|por|0.27|130|
|ron|lat|0.04|70|
|rus|mkd|0.02|10|
|slk|epo|0.01|10|
|slv|min|0.01|10|
|spa|vol|0.01|10|
|sqi|zul|0.01|30|
|sun|ind|0.60|790|
|swe|dan|0.02|30|
|tat|bak|0.03|30|
|tgl|ceb|0.01|10|
|tur|min|0.01|10|
|ukr|che|0.02|10|
|uzb|kir|0.02|10|
|vie|war|0.02|30|
|zul|swa|0.02|10|


was (Author: talli...@mitre.org):
In going down the path of sampling, or stopping short...I wanted to see how 
much text would be necessary for OpenNLP.  In the following table, I show the 
maximum wrong confidence for a given language, the incorrectly detected 
language, and the text length at which that was incorrectly detected.  For 
example, at text length of 230 characters, OpenNLP had a confidence of 0.43 
that the text was 'hrv', but it was really 'bos'.

As the original confusion matrix shows, some lang pairs are much harder and 
require more evidence, e.g. {{ekk}} and {{est}}, {{fas}} and {{pes}}, {{hrv}} 
and {{bos}}, {{ind}} and {{sun}}, {{pus}} and {{por}}, but many languages 
require a very small amount of text...

||Lang||WrongId||MaxWrongConf||MaxWrongLength||
|ast|nno|0.03|90|
|bak|tat|0.08|70|
|bos|hrv|0.43|230|
|cat|vol|0.01|10|
|ces|slk|0.01|10|
|cym|min|0.01|10|
|dan|war|0.02|30|
|deu|war|0.01|10|
|ekk|est|0.54|310|
|eng|nan|0.01|10|
|est|ekk|0.56|250|
|fas|pes|0.52|550|
|fin|min|0.01|10|
|fra|fin|0.01|10|
|gsw|lat|0.01|10|
|hrv|bos|0.64|1010|
|hun|nob|0.01|10|
|ind|sun|0.73|810|
|isl|fao|0.02|30|
|ita|fra|0.04|110|
|jav|afr|0.02|50|
|lav|lvs|0.35|170|
|lim|epo|0.02|30|
|ltz|vol|0.01|10|
|lvs|lav|0.03|30|
|mlt|eng|0.02|50|
|msa|ind|0.45|490|
|nan|tur|0.01|10|
|nds|plt|0.01|10|
|nep|san|0.01|10|
|nld|plt|0.01|10|
|nno|nob|0.12|130|
|nob|nno|0.62|290|
|oci|ita|0.01|10|
|pes|fas|0.57|730|
|pus|por|0.27|130|
|ron|lat|0.04|70|
|rus|mkd|0.02|10|
|slk|epo|0.01|10|
|slv|min|0.01|10|
|spa|vol|0.01|10|
|sqi|zul|0.01|30|
|sun|ind|0.60|790|
|swe|dan|0.02|30|
|tat|bak|0.03|30|
|tgl|ceb|0.01|10|
|tur|min|0.01|10|
|ukr|che|0.02|10|
|uzb|kir|0.02|10|
|vie|war|0.02|30|
|zul|swa|0.02|10|

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Thomas van Hesteren (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856103#comment-16856103
 ] 

Thomas van Hesteren commented on TIKA-2889:
---

I use the following code to start tika:
java -Dlog4j.configuration="D:\log4j.xml" -Xms128m -Xmx902m -jar 
"D:\tika-server.jar" -log debug --host=127.0.0.1 --port=12345

However, I get no logging at all...? What am I doing wrong?

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856102#comment-16856102
 ] 

Tim Allison commented on TIKA-2790:
---

The above was with a single sample.  I wanted to gauge variance.  So I ran 100 
iterations of randomly selected sentences:

||Language||MaxWrongConfAvg||MaxWrongConfStdev||LengthAtMaxWrongConf||LengthAtMaxWrongStdev||
|afr|0.02|0.01|23.5|20.9|
|ara|0.02|0.00|14.3|12.6|
|ast|0.04|0.05|51.5|45.7|
|aze|0.02|0.01|18.8|17.9|
|bak|0.24|0.18|133.5|91.8|
|bel|0.03|0.03|29.2|36.1|
|ben|0.02|0.01|94.4|62.3|
|bos|0.56|0.09|446.0|139.0|
|bre|0.01|0.01|15.9|18.1|
|bul|0.02|0.01|20.3|28.1|
|cat|0.03|0.03|54.0|57.3|
|ceb|0.04|0.05|43.3|53.8|
|ces|0.07|0.13|50.6|65.5|
|che|0.03|0.03|30.0|46.2|
|cmn|0.01|0.00|18.0|11.0|
|cym|0.02|0.01|17.9|17.1|
|dan|0.03|0.05|40.9|44.5|
|deu|0.02|0.02|23.5|24.4|
|ekk|0.30|0.21|189.8|117.3|
|ell|0.03|0.02|76.7|83.3|
|eng|0.02|0.01|24.0|22.0|
|epo|0.01|0.00|13.3|9.5|
|est|0.24|0.21|160.5|119.5|
|eus|0.02|0.01|24.0|23.9|
|fao|0.01|0.01|14.5|12.4|
|fas|0.46|0.14|359.0|136.3|
|fin|0.02|0.01|17.1|17.4|
|fra|0.02|0.01|21.1|20.7|
|fry|0.04|0.08|31.4|45.9|
|gle|0.01|0.00|12.8|11.6|
|glg|0.02|0.02|28.5|28.4|
|gsw|0.03|0.04|35.5|50.4|
|heb|0.04|0.02|115.0|34.2|
|hin|0.02|0.01|15.6|9.1|
|hrv|0.50|0.15|489.4|218.1|
|hun|0.01|0.01|17.8|16.5|
|hye|0.02|0.01|34.0|35.8|
|ind|0.73|0.08|1087.8|348.0|
|isl|0.02|0.01|19.7|18.8|
|ita|0.02|0.02|27.9|40.8|
|jav|0.04|0.05|51.0|55.3|
|jpn|0.02|0.01|19.1|24.3|
|kan|0.01|0.01|40.0|47.6|
|kat|0.03|0.01|63.3|46.2|
|kaz|0.02|0.01|19.7|17.4|
|kir|0.04|0.05|49.0|53.6|
|kor|0.02|0.00|70.0|0.0|
|lat|0.02|0.02|24.8|31.6|
|lav|0.21|0.17|138.2|94.6|
|lim|0.04|0.04|40.4|37.7|
|lit|0.01|0.01|15.4|16.4|
|ltz|0.02|0.01|22.4|19.8|
|lvs|0.18|0.18|121.9|114.7|
|mal|0.02|0.00|50.0|0.0|
|mar|0.02|0.01|18.1|23.2|
|min|0.03|0.03|33.6|39.3|
|mkd|0.03|0.02|27.3|34.9|
|mlt|0.01|0.01|16.6|16.6|
|mon|0.02|0.01|15.2|11.5|
|mri|0.04|0.03|81.1|76.5|
|msa|0.26|0.15|280.0|171.6|
|nan|0.01|0.00|10.7|3.6|
|nds|0.02|0.02|25.1|33.5|
|nep|0.02|0.01|14.6|10.7|
|nld|0.03|0.03|31.6|35.8|
|nno|0.10|0.13|98.8|108.1|
|nob|0.13|0.17|95.3|100.4|
|oci|0.02|0.02|25.9|31.4|
|pan|0.01|0.00|10.0|0.0|
|pes|0.67|0.12|939.5|436.5|
|plt|0.01|0.00|13.1|7.5|
|pnb|0.07|0.07|57.3|42.8|
|pol|0.01|0.00|12.0|6.1|
|por|0.02|0.02|31.8|34.4|
|pus|0.20|0.13|165.8|91.3|
|ron|0.02|0.00|19.1|12.6|
|rus|0.03|0.02|31.5|30.6|
|san|0.03|0.02|32.0|42.6|
|sin|0.03|0.02|53.3|53.1|
|slk|0.02|0.02|17.8|21.5|
|slv|0.01|0.01|17.8|17.9|
|som|0.02|0.01|22.3|28.5|
|spa|0.03|0.05|43.4|61.5|
|sqi|0.01|0.00|16.5|10.1|
|srp|0.02|0.01|15.3|11.6|
|sun|0.57|0.08|711.4|224.6|
|swa|0.02|0.01|21.2|27.0|
|swe|0.02|0.01|19.7|18.9|
|tam|0.02|0.00|23.3|23.1|
|tat|0.25|0.24|152.6|144.0|
|tgk|0.05|0.07|56.2|74.7|
|tgl|0.02|0.01|22.8|34.8|
|tha|0.03|0.01|40.0|25.8|
|tur|0.01|0.00|16.9|14.7|
|ukr|0.02|0.01|13.7|10.1|
|urd|0.10|0.12|80.7|98.5|
|uzb|0.02|0.01|19.9|20.0|
|vie|0.02|0.01|16.7|23.1|
|vol|0.01|0.00|14.7|19.4|
|war|0.02|0.01|22.5|31.2|
|zul|0.02|0.01|22.1|34.0|

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856101#comment-16856101
 ] 

Tim Allison commented on TIKA-2790:
---

In going down the path of sampling, or stopping short...I wanted to see how 
much text would be necessary for OpenNLP.  In the following table, I show the 
maximum wrong confidence for a given language, the incorrectly detected 
language, and the text length at which that was incorrectly detected.  For 
example, at text length of 230 characters, OpenNLP had a confidence of 0.43 
that the text was 'hrv', but it was really 'bos'.

As the original confusion matrix shows, some lang pairs are much harder and 
require more evidence, e.g. {{ekk}} and {{est}}, {{fas}} and {{pes}}, {{hrv}} 
and {{bos}}, {{ind}} and {{sun}}, {{pus}} and {{por}}, but many languages 
require a very small amount of text...

||Lang||WrongId||MaxWrongConf||MaxWrongLength||
|ast|nno|0.03|90|
|bak|tat|0.08|70|
|bos|hrv|0.43|230|
|cat|vol|0.01|10|
|ces|slk|0.01|10|
|cym|min|0.01|10|
|dan|war|0.02|30|
|deu|war|0.01|10|
|ekk|est|0.54|310|
|eng|nan|0.01|10|
|est|ekk|0.56|250|
|fas|pes|0.52|550|
|fin|min|0.01|10|
|fra|fin|0.01|10|
|gsw|lat|0.01|10|
|hrv|bos|0.64|1010|
|hun|nob|0.01|10|
|ind|sun|0.73|810|
|isl|fao|0.02|30|
|ita|fra|0.04|110|
|jav|afr|0.02|50|
|lav|lvs|0.35|170|
|lim|epo|0.02|30|
|ltz|vol|0.01|10|
|lvs|lav|0.03|30|
|mlt|eng|0.02|50|
|msa|ind|0.45|490|
|nan|tur|0.01|10|
|nds|plt|0.01|10|
|nep|san|0.01|10|
|nld|plt|0.01|10|
|nno|nob|0.12|130|
|nob|nno|0.62|290|
|oci|ita|0.01|10|
|pes|fas|0.57|730|
|pus|por|0.27|130|
|ron|lat|0.04|70|
|rus|mkd|0.02|10|
|slk|epo|0.01|10|
|slv|min|0.01|10|
|spa|vol|0.01|10|
|sqi|zul|0.01|30|
|sun|ind|0.60|790|
|swe|dan|0.02|30|
|tat|bak|0.03|30|
|tgl|ceb|0.01|10|
|tur|min|0.01|10|
|ukr|che|0.02|10|
|uzb|kir|0.02|10|
|vie|war|0.02|30|
|zul|swa|0.02|10|

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856099#comment-16856099
 ] 

Tim Allison commented on TIKA-2790:
---

[~kkrugler] I think I understand some of the string processing components that 
make it fast, but how does processing time not grow linearly if you aren't 
sampling?

||Detector||Length||Millis||Avg(ms)||Stdev||
|YalderDetector|10|4442|0.06|1.19|
|YalderDetector|50|13017|0.17|0.4|
|YalderDetector|100|14149|0.19|0.41|
|YalderDetector|200|14686|0.2|0.41|
|YalderDetector|500|14536|0.19|0.43|
|YalderDetector|1000|14993|0.2|0.41|
|YalderDetector|5000|16627|0.22|0.43|
|YalderDetector|1|18884|0.25|0.46|
|YalderDetector|2|20702|0.28|0.48|
|YalderDetector|5|21749|0.29|0.48|
|YalderDetector|10|23503|0.32|0.49|

 

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2890) Critical security vulnerability in depedencies

2019-06-04 Thread Kyle DuPont (JIRA)
Kyle DuPont created TIKA-2890:
-

 Summary: Critical security vulnerability in depedencies
 Key: TIKA-2890
 URL: https://issues.apache.org/jira/browse/TIKA-2890
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.21
Reporter: Kyle DuPont


The parser dependency jackson-databind:2.9.8 has a critical vulnerability as 
per:

[https://ossindex.sonatype.org/vuln/5bbadb96-496f-4534-a513-7a6396f54029]

This should be bumped to >2.9.9 to resolve this vulnerability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Thomas van Hesteren (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas van Hesteren updated TIKA-2889:
--
Comment: was deleted

(was: Hmm I already though so... however, there are no log files in de root 
directory... I will do some error checking tomorrow)

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Ken Krugler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856052#comment-16856052
 ] 

Ken Krugler commented on TIKA-2790:
---

Yalder processes the entire string. I thought Optimaize's version of 
LangDetector does some sampling, not sure.

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Thomas van Hesteren (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856041#comment-16856041
 ] 

Thomas van Hesteren edited comment on TIKA-2889 at 6/4/19 7:23 PM:
---

Hmm I already though so... however, there are no log files in de root 
directory... I will do some error checking tomorrow


was (Author: thomasvh):
Hmm I already though so... however, there are no log files in de root 
directory... 

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Thomas van Hesteren (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856041#comment-16856041
 ] 

Thomas van Hesteren commented on TIKA-2889:
---

Hmm I already though so... however, there are no log files in de root 
directory... 

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856039#comment-16856039
 ] 

Tim Allison commented on TIKA-2790:
---

I was able to get 4x improvement in speed, which is still slower than Optimaize 
and, far, far slower than Yalder.  IIUC, both Optimaize and Yalder do not 
process the full string.  Rather, they sample or have some kind of stopping 
criterion.  I think we can work towards that in our own wrapper of OpenNLP, 
and, hopefully, we can push that upstream back into OpenNLP.

> Consider switching lang-detection in tika-eval to open-nlp
> --
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Attachments: fra_mixed_10_0.0_0.txt, langid_20190509.zip, 
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856036#comment-16856036
 ] 

Tim Allison commented on TIKA-2889:
---

You should see a bunch of logging.  There should be a file called "tika.log" in 
the directory where you kicked off your process.  If you want to hardcode a 
path for it, modify "tika.log" in the log4j.xml file to, e.g.{{}}

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2886) StreamingZipContainerDetector fails on XLSX template workbook

2019-06-04 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856015#comment-16856015
 ] 

Hudson commented on TIKA-2886:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #200 (See 
[https://builds.apache.org/job/tika-branch-1x/200/])
TIKA-2886 -- improve streamingzipcontainerdetector (tallison: 
[https://github.com/apache/tika/commit/dbf8ef696336ba3fc8e435834de484e785a4fdd0])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipContainerDetectorTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
* (add) 
tika-parsers/src/test/resources/test-documents/testEXCEL_macro_enabled_template.xltm
* (add) tika-parsers/src/test/resources/test-documents/testEXCEL_template.xltx
* (add) tika-parsers/src/test/resources/test-documents/testEXCEL_template.xlt
* (edit) CHANGES.txt
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/StreamingZipContainerDetector.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetectorBase.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java


> StreamingZipContainerDetector fails on XLSX template workbook
> -
>
> Key: TIKA-2886
> URL: https://issues.apache.org/jira/browse/TIKA-2886
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.22
>
>
> Reported by Tucker B barbct5 on the user list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2886) StreamingZipContainerDetector fails on XLSX template workbook

2019-06-04 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855994#comment-16855994
 ] 

Hudson commented on TIKA-2886:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #423 (See 
[https://builds.apache.org/job/tika-2.x-windows/423/])
TIKA-2886 -- improving streaming zip container detection for xltx, xltm 
(tallison: rev b1adfb9c67e6321264f4451af822f999081fc326)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetectorBase.java
* (add) tika-parsers/src/test/resources/test-documents/testEXCEL_template.xltx
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/StreamingZipContainerDetector.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
* (add) 
tika-parsers/src/test/resources/test-documents/testEXCEL_macro_enabled_template.xltm
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java
* (add) tika-parsers/src/test/resources/test-documents/testEXCEL_template.xlt
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipContainerDetectorTest.java
* (edit) CHANGES.txt


> StreamingZipContainerDetector fails on XLSX template workbook
> -
>
> Key: TIKA-2886
> URL: https://issues.apache.org/jira/browse/TIKA-2886
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.22
>
>
> Reported by Tucker B barbct5 on the user list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2886) StreamingZipContainerDetector fails on XLSX template workbook

2019-06-04 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855978#comment-16855978
 ] 

Hudson commented on TIKA-2886:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1668 (See 
[https://builds.apache.org/job/Tika-trunk/1668/])
TIKA-2886 -- improving streaming zip container detection for xltx, xltm 
(tallison: 
[https://github.com/apache/tika/commit/b1adfb9c67e6321264f4451af822f999081fc326])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipContainerDetectorTest.java
* (edit) CHANGES.txt
* (add) tika-parsers/src/test/resources/test-documents/testEXCEL_template.xlt
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetectorBase.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
* (add) tika-parsers/src/test/resources/test-documents/testEXCEL_template.xltx
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/StreamingZipContainerDetector.java
* (add) 
tika-parsers/src/test/resources/test-documents/testEXCEL_macro_enabled_template.xltm


> StreamingZipContainerDetector fails on XLSX template workbook
> -
>
> Key: TIKA-2886
> URL: https://issues.apache.org/jira/browse/TIKA-2886
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.22
>
>
> Reported by Tucker B barbct5 on the user list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Thomas van Hesteren (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855955#comment-16855955
 ] 

Thomas van Hesteren commented on TIKA-2889:
---

Can you let me know where those files should be written and when? For now, I 
have updated the software with your file and no logging files are written (no 
crashes have been detected yet by my watchdog). So, will it only log when error 
occur? 

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Thomas van Hesteren (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855919#comment-16855919
 ] 

Thomas van Hesteren commented on TIKA-2889:
---

Thanks, I will modify my code so I hopefully get some debug logging

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2889:
--
Attachment: log4j.xml

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2889:
--
Attachment: (was: original-tika-server-2.0.0-SNAPSHOT.jar)

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855911#comment-16855911
 ] 

Tim Allison commented on TIKA-2889:
---

{{java -Dlog4j.configuration=file:log4j.xml -jar tika-server-2.0.0-SNAPSHOT.jar 
-log debug}}

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: log4j.xml
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2889:
--
Attachment: original-tika-server-2.0.0-SNAPSHOT.jar

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
> Attachments: original-tika-server-2.0.0-SNAPSHOT.jar
>
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2886) StreamingZipContainerDetector fails on XLSX template workbook

2019-06-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2886.
---
   Resolution: Fixed
 Assignee: Tim Allison
Fix Version/s: 1.22

Thank you, Tucker B!

> StreamingZipContainerDetector fails on XLSX template workbook
> -
>
> Key: TIKA-2886
> URL: https://issues.apache.org/jira/browse/TIKA-2886
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.22
>
>
> Reported by Tucker B barbct5 on the user list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Thomas van Hesteren (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855892#comment-16855892
 ] 

Thomas van Hesteren commented on TIKA-2889:
---

Thanks for your reply. I can't replicate the problem with certain files. If I 
re-run the files, it won't 'crash' on the same file over and over. So, it's not 
the file which is breaking the server. It's some other variable (so, when I run 
the file with tike-app it works fine).

Could you indicate how I could configure the logging of the TIKA-server to a 
certain directory? Then I'm able to monitor the TIKA logs.

I'm aware of the spawnChild mode, however this is not ideal for my situation. 
My watchdog works fine and respawns TIKA straight away, but I would like to 
prevent it from crashing obviously

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Thomas van Hesteren (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855892#comment-16855892
 ] 

Thomas van Hesteren edited comment on TIKA-2889 at 6/4/19 4:38 PM:
---

Thanks for your reply. I can't reproduce the problem with certain files. If I 
re-run the files, it won't 'crash' on the same file over and over. So, it's not 
the file which is breaking the server. It's some other variable (so, when I run 
the file with tike-app it works fine).

Could you indicate how I could configure the logging of the TIKA-server to a 
certain directory? Then I'm able to monitor the TIKA logs.

I'm aware of the spawnChild mode, however this is not ideal for my situation. 
My watchdog works fine and respawns TIKA straight away, but I would like to 
prevent it from crashing obviously


was (Author: thomasvh):
Thanks for your reply. I can't replicate the problem with certain files. If I 
re-run the files, it won't 'crash' on the same file over and over. So, it's not 
the file which is breaking the server. It's some other variable (so, when I run 
the file with tike-app it works fine).

Could you indicate how I could configure the logging of the TIKA-server to a 
certain directory? Then I'm able to monitor the TIKA logs.

I'm aware of the spawnChild mode, however this is not ideal for my situation. 
My watchdog works fine and respawns TIKA straight away, but I would like to 
prevent it from crashing obviously

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2886) StreamingZipContainerDetector fails on XLSX template workbook

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855882#comment-16855882
 ] 

Tim Allison commented on TIKA-2886:
---

I made a slight modification to the TestContainerAwareDetector that will run 
the StreamingZipContainerDetector against all children of {{application/zip}}.  
This revealed a few other areas for improvement, including visio and a few 
others.

It turns out that we didn't have any xltm or xltx among our unit tests.  I 
added examples for those.

> StreamingZipContainerDetector fails on XLSX template workbook
> -
>
> Key: TIKA-2886
> URL: https://issues.apache.org/jira/browse/TIKA-2886
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> Reported by Tucker B barbct5 on the user list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2887) Build fails with CVE warnings

2019-06-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2887.
---
   Resolution: Duplicate
Fix Version/s: 1.22

Thank you for raising this.  Those were fixed in branch_1x and master shortly 
after the 1.21 release.

> Build fails with CVE warnings
> -
>
> Key: TIKA-2887
> URL: https://issues.apache.org/jira/browse/TIKA-2887
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.21
> Environment: Win 10 64bit
> JDK 8
>Reporter: T Craig
>Priority: Major
>  Labels: newbie, security, windows
> Fix For: 1.22
>
>
> When running the initial build script it fails with two CVE warnings.
> [INFO] Apache Tika OSGi bundle 1.21 ... FAILURE [  9.932 
> s]
> [INFO] BUILD FAILURE
> [ERROR] Failed to execute goal 
> org.sonatype.ossindex.maven:ossindex-maven-plugin:3.0.4:audit 
> (audit-dependencies) on project tika-bundle: Detected 2 vulnerable components:
> [ERROR] com.fasterxml.jackson.core:jackson-databind:jar:2.9.8:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/com.fasterxml.jackson.core/jackson-databind@2.9.8
> [ERROR] * [CVE-2019-12086] Information Exposure (7.5); 
> https://ossindex.sonatype.org/vuln/5bbadb96-496f-4534-a513-7a6396f54029
> [ERROR] c3p0:c3p0:jar:0.9.1.1:compile; 
> https://ossindex.sonatype.org/component/pkg:maven/c3p0/c3p0@0.9.1.1
> [ERROR] * [CVE-2019-5427] Resource Management Errors (7.5); 
> https://ossindex.sonatype.org/vuln/d25f4c21-9e76-4fc2-9d73-3770aa3aec56



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2861) MP4Parser not getting gps metadata

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855789#comment-16855789
 ] 

Tim Allison commented on TIKA-2861:
---

No...sorry...no news.  Thank you for the link.  I'll take a look!

> MP4Parser not getting gps metadata
> --
>
> Key: TIKA-2861
> URL: https://issues.apache.org/jira/browse/TIKA-2861
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Reporter: Junran
>Priority: Major
> Attachments: 1.mp4, 2.MOV
>
>
> Hello, MP4Parser is not getting video GPS metadata which is extracted for 
> images such as jpeg. I have checked both MP4 and MOV files, the files I 
> checked all have GPS Exif data embedded in the same fields as image. Any 
> idea? Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855787#comment-16855787
 ] 

Tim Allison commented on TIKA-2889:
---

tika-server does log quite a bit.  Can you see and/or configure logging so that 
you can see what's going on?

Also, there's a -spawnChild mode that spawns a child process and restarts the 
child if there are problems.

Let's figure out what's going on with straight tika-server first, though.  

If you run tika-app.jar against the files, are you able to replicate the 
problem?  

> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2019-06-04 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov closed TIKA-879.
--

> Detection problem: message/rfc822 file is detected as text/plain.
> -
>
> Key: TIKA-879
> URL: https://issues.apache.org/jira/browse/TIKA-879
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, mime
>Affects Versions: 1.0, 1.1, 1.2
> Environment: linux 3.2.9
> oracle jdk7, openjdk7, sun jdk6
>Reporter: Konstantin Gribov
>Priority: Major
>  Labels: new-parser
> Fix For: 2.0, 1.18
>
> Attachments: TIKA-879-thunderbird.eml, mbox_email_section.txt, 
> mime_diffs_A_to_B.html
>
>
> When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
> can test it on {{testRFC822}} and {{testRFC822_base64}} in 
> {{tika-parsers/src/test/resources/test-documents/}}).
> Main reason for such behavior is that only magic detector is really works for 
> such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
> file name in {{RESOURCE_NAME_KEY}}.
> As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", 
> "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
> works only by magic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (TIKA-2209) Update PDFBox to 2.0.4

2019-06-04 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov closed TIKA-2209.
---

> Update PDFBox to 2.0.4
> --
>
> Key: TIKA-2209
> URL: https://issues.apache.org/jira/browse/TIKA-2209
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Trivial
> Fix For: 2.0, 1.15
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (TIKA-2681) Upgrade to PDFBox 2.0.11

2019-06-04 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov closed TIKA-2681.
---

> Upgrade to PDFBox 2.0.11
> 
>
> Key: TIKA-2681
> URL: https://issues.apache.org/jira/browse/TIKA-2681
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 2.0, 1.19
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (TIKA-2622) Upgrade to PDFBox 2.0.10 when available

2019-06-04 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov closed TIKA-2622.
---

> Upgrade to PDFBox 2.0.10 when available
> ---
>
> Key: TIKA-2622
> URL: https://issues.apache.org/jira/browse/TIKA-2622
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Konstantin Gribov
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Thomas van Hesteren (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas van Hesteren updated TIKA-2889:
--
Description: 
I have a document processor which sends documents to the Tika Server over cUrl. 
However, the server crashes multiple times (not document specific). The 
response I get from cUrl if it happens is as follows:

Connection error: Couldn't connect to server

 

The Tika server is started when the script starts executing. For now, I fixed 
the issue by making a watcher which restarts the tika server when it crashes. 
It then processes a few other documents and crashes again (after a few minutes, 
let's say 5 minutes tops).

 

Is there any possibility to catch the exception (if it throws any?)

 
A log which shows the crash of the server:

04-06-2019 15:49:25|Processing a file of: 52.3kB

04-06-2019 15:49:24|Processing a file of: 255.5kB

04-06-2019 15:49:24|Processing a file of: 241.6kB

04-06-2019 15:49:23|Processing a file of: 37.7kB

04-06-2019 15:49:22|Processing a file of: 1.27MB

04-06-2019 15:49:21|Processing a file of: 55.8kB

04-06-2019 15:49:17|Processing a file of: 114.5kB

04-06-2019 15:49:08|Server is not running. Restarting Server. Connection error: 
Couldn't connect to server

04-06-2019 15:49:03|Processing a file of: 41.0kB

04-06-2019 15:49:00|Processing a file of: 38.0kB

04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB

04-06-2019 15:48:59|Processing a file of: 60.2kB

04-06-2019 15:48:59|Processing a file of: 280.7kB

04-06-2019 15:48:59|Processing a file of: 3.30MB

  was:
I have a document processor which sends documents to the Tika Server over cUrl. 
However, the server crashes multiple times (not document specific). The 
response I get from cUrl if it happens is as follows:

Connection error: Couldn't connect to server

 

The Tika server is started when the script starts executing. For now, I fixed 
the issue by making a watcher which restarts the tika server when it crashes. 
It then processes a few other documents and crashes again (after a few minutes, 
let's say 5 minutes tops).

 

Is there any possibility to catch the exception (if it throws any?)


> Tika Server keeps crashing
> --
>
> Key: TIKA-2889
> URL: https://issues.apache.org/jira/browse/TIKA-2889
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.18, 1.19, 1.19.1, 1.21
> Environment: Both Ubuntu and Windows have the same bug/issue
>Reporter: Thomas van Hesteren
>Priority: Minor
>
> I have a document processor which sends documents to the Tika Server over 
> cUrl. However, the server crashes multiple times (not document specific). The 
> response I get from cUrl if it happens is as follows:
> Connection error: Couldn't connect to server
>  
> The Tika server is started when the script starts executing. For now, I fixed 
> the issue by making a watcher which restarts the tika server when it crashes. 
> It then processes a few other documents and crashes again (after a few 
> minutes, let's say 5 minutes tops).
>  
> Is there any possibility to catch the exception (if it throws any?)
>  
> A log which shows the crash of the server:
> 04-06-2019 15:49:25|Processing a file of: 52.3kB
> 04-06-2019 15:49:24|Processing a file of: 255.5kB
> 04-06-2019 15:49:24|Processing a file of: 241.6kB
> 04-06-2019 15:49:23|Processing a file of: 37.7kB
> 04-06-2019 15:49:22|Processing a file of: 1.27MB
> 04-06-2019 15:49:21|Processing a file of: 55.8kB
> 04-06-2019 15:49:17|Processing a file of: 114.5kB
> 04-06-2019 15:49:08|Server is not running. Restarting Server. Connection 
> error: Couldn't connect to server
> 04-06-2019 15:49:03|Processing a file of: 41.0kB
> 04-06-2019 15:49:00|Processing a file of: 38.0kB
> 04-06-2019 15:48:59|ProcesPsing a file of: 37.1kB
> 04-06-2019 15:48:59|Processing a file of: 60.2kB
> 04-06-2019 15:48:59|Processing a file of: 280.7kB
> 04-06-2019 15:48:59|Processing a file of: 3.30MB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2889) Tika Server keeps crashing

2019-06-04 Thread Thomas van Hesteren (JIRA)
Thomas van Hesteren created TIKA-2889:
-

 Summary: Tika Server keeps crashing
 Key: TIKA-2889
 URL: https://issues.apache.org/jira/browse/TIKA-2889
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.21, 1.19.1, 1.19, 1.18
 Environment: Both Ubuntu and Windows have the same bug/issue
Reporter: Thomas van Hesteren


I have a document processor which sends documents to the Tika Server over cUrl. 
However, the server crashes multiple times (not document specific). The 
response I get from cUrl if it happens is as follows:

Connection error: Couldn't connect to server

 

The Tika server is started when the script starts executing. For now, I fixed 
the issue by making a watcher which restarts the tika server when it crashes. 
It then processes a few other documents and crashes again (after a few minutes, 
let's say 5 minutes tops).

 

Is there any possibility to catch the exception (if it throws any?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)