[jira] [Commented] (TIKA-3510) tika-parser-scientific-module seems to embbed many dependencies

2021-08-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397680#comment-17397680
 ] 

Hudson commented on TIKA-3510:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #308 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/308/])
TIKA-3510 -- further fixes (tallison: 
[https://github.com/apache/tika/commit/a2b21f85817333c4e8396713069e6b389899af82])
* (edit) 
tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/pom.xml
* (edit) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package/pom.xml
* (edit) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/pom.xml
* (edit) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-module/pom.xml
* (edit) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-module/pom.xml


> tika-parser-scientific-module seems to embbed many dependencies
> ---
>
> Key: TIKA-3510
> URL: https://issues.apache.org/jira/browse/TIKA-3510
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: Thomas Mortagne
>Priority: Major
>
> tika-parser-scientific-module 2.0.0 contains many files from other artifacts:
> * joda-time
> * slf4j
> * commons-io
> * ...
> Is that really expected ?
> tika-parser-sqlite3-module seems to be affected too



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3510) tika-parser-scientific-module seems to embbed many dependencies

2021-08-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397641#comment-17397641
 ] 

Tim Allison commented on TIKA-3510:
---

[~tmortagne], please take a look and see if this will meet your needs.  If you 
can recommend a more elegant solution, I'd appreciate it.  I'm not thrilled 
with it as is.

Thank you [~kkrugler] for your feedback!  I went with 1/2 of it for now.  
Onwards!

> tika-parser-scientific-module seems to embbed many dependencies
> ---
>
> Key: TIKA-3510
> URL: https://issues.apache.org/jira/browse/TIKA-3510
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: Thomas Mortagne
>Priority: Major
>
> tika-parser-scientific-module 2.0.0 contains many files from other artifacts:
> * joda-time
> * slf4j
> * commons-io
> * ...
> Is that really expected ?
> tika-parser-sqlite3-module seems to be affected too



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3502) General upgrades for 2.0.1

2021-08-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397535#comment-17397535
 ] 

Hudson commented on TIKA-3502:
--

FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #306 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/306/])
TIKA-3502 -- general upgrades for the next 2.x version. (tallison: 
[https://github.com/apache/tika/commit/9e02bbfe3234bc8d31e4f20fe195ada54163514b])
* (edit) tika-parent/pom.xml


> General upgrades for 2.0.1
> --
>
> Key: TIKA-3502
> URL: https://issues.apache.org/jira/browse/TIKA-3502
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3522) Reduce calls to TikaConfig.getDefaultConfig

2021-08-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397534#comment-17397534
 ] 

Hudson commented on TIKA-3522:
--

FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #306 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/306/])
TIKA-3522 -- reduce use of TikaConfig.getDefaultConfig (tallison: 
[https://github.com/apache/tika/commit/0139e71e3fb27ee70d73c92764d6b7a3fdb56462])
* (edit) 
tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java
* (edit) 
tika-eval/tika-eval-core/src/test/java/org/apache/tika/eval/core/util/MimeUtilTest.java
* (edit) tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-jdbc-commons/src/main/java/org/apache/tika/parser/jdbc/JDBCTableReader.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/TestParsers.java
* (edit) 
tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/AbstractProfiler.java
* (edit) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-module/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java
* (edit) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-module/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java


> Reduce calls to TikaConfig.getDefaultConfig
> ---
>
> Key: TIKA-3522
> URL: https://issues.apache.org/jira/browse/TIKA-3522
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> This is expensive, and we can reduce calls to it in some of our unit tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3521) Move checkActive out of fetchemitworkers within AsyncProcessor

2021-08-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397533#comment-17397533
 ] 

Hudson commented on TIKA-3521:
--

FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #306 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/306/])
TIKA-3521 -- move check active outside of the parse threads (tallison: 
[https://github.com/apache/tika/commit/76458ffddd984b699bad59a838fdc239546bdb69])
* (edit) tika-core/src/main/java/org/apache/tika/pipes/async/AsyncProcessor.java


> Move checkActive out of fetchemitworkers within AsyncProcessor
> --
>
> Key: TIKA-3521
> URL: https://issues.apache.org/jira/browse/TIKA-3521
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0.1
>
>
> The heartbeat check in AsyncProcessor is carried out by the parse threads.  
> However, this check should continue through the life of the object.  The 
> parse threads may complete, but the emitter threads may still be active.  
> Let's move this to a separate thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3515) Tika CLI -t should use UTF-8 as default output encoding

2021-08-11 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397523#comment-17397523
 ] 

Luís Filipe Nassif commented on TIKA-3515:
--

??I think we should also deprecate the initialization of WriteOutContentHandler 
and ToTextContentHandler with only an outputstream because these call 
Charset.getDefaultCharset().??

Agreed, thank you [~tallison]!

> Tika CLI -t should use UTF-8 as default output encoding
> ---
>
> Key: TIKA-3515
> URL: https://issues.apache.org/jira/browse/TIKA-3515
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 1.27
> Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302
>Reporter: Luís Filipe Nassif
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, 
> LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, 
> LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, 
> LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 
> PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, 
> image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png
>
>
> Some Korean chars are extracted as squares. The encodings of plain texts are 
> detected correctly. Maybe this is related with the content handler (just a 
> guess). I'll attach the triggering files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-08-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397511#comment-17397511
 ] 

Hudson commented on TIKA-3489:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #144 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/144/])
TIKA-3489 -- detect robots.txt files as text/x-robots (tallison: 
[https://github.com/apache/tika/commit/27e7eac5fc7c2122076237c191a2bd0aa2748aa4])
* (add) tika-parsers/src/test/resources/test-documents/testRobots.txt
* (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> Robots.txt files frequently identified as message/rfc822
> 
>
> Key: TIKA-3489
> URL: https://issues.apache.org/jira/browse/TIKA-3489
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 2.0.0, 1.25, 1.26, 1.27
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0.1, 1.27.1
>
> Attachments: robots.txt
>
>
> The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if 
> the file starts with a "User-Agent" rule and contains also a second rule not 
> too far away from the beginning, e.g.:
> {noformat}
> User-Agent: goodbot
> Disallow:
> User-Agent: badbot
> Disallow: /
> {noformat}
> The change 
> [7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
>  requires that two different clauses are matched. However, the two 
> occurrences of "User-Agent:" (initial and after a new line) are treated as 
> different instead of equivalent matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3521) Move checkActive out of fetchemitworkers within AsyncProcessor

2021-08-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3521.
---
Fix Version/s: 2.0.1
 Assignee: Tim Allison
   Resolution: Fixed

> Move checkActive out of fetchemitworkers within AsyncProcessor
> --
>
> Key: TIKA-3521
> URL: https://issues.apache.org/jira/browse/TIKA-3521
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0.1
>
>
> The heartbeat check in AsyncProcessor is carried out by the parse threads.  
> However, this check should continue through the life of the object.  The 
> parse threads may complete, but the emitter threads may still be active.  
> Let's move this to a separate thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3522) Reduce calls to TikaConfig.getDefaultConfig

2021-08-11 Thread Tim Allison (Jira)
Tim Allison created TIKA-3522:
-

 Summary: Reduce calls to TikaConfig.getDefaultConfig
 Key: TIKA-3522
 URL: https://issues.apache.org/jira/browse/TIKA-3522
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


This is expensive, and we can reduce calls to it in some of our unit tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3515) Tika CLI -t should use UTF-8 as default output encoding

2021-08-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397490#comment-17397490
 ] 

Hudson commented on TIKA-3515:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #305 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/305/])
TIKA-3515 -- Tika CLI -t should use UTF-8 as default output encoding (tallison: 
[https://github.com/apache/tika/commit/c792036e618f71fca851fd2ec90e8d23aaffd3d5])
* (edit) tika-core/src/main/java/org/apache/tika/sax/ToTextContentHandler.java
* (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* (edit) 
tika-parsers/tika-parsers-ml/tika-dl/src/main/java/org/apache/tika/dl/imagerec/DL4JInceptionV3Net.java
* (edit) 
tika-core/src/test/java/org/apache/tika/sax/RichTextContentHandlerTest.java
* (edit) 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/test/java/org/apache/tika/parser/ner/NamedEntityParserTest.java
* (edit) CHANGES.txt
* (edit) tika-core/src/main/java/org/apache/tika/sax/WriteOutContentHandler.java
* (edit) 
tika-parsers/tika-parsers-ml/tika-age-recogniser/src/test/java/org/apache/tika/parser/recognition/AgeRecogniserTest.java
* (edit) 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/test/java/org/apache/tika/parser/sentiment/SentimentAnalysisParserTest.java


> Tika CLI -t should use UTF-8 as default output encoding
> ---
>
> Key: TIKA-3515
> URL: https://issues.apache.org/jira/browse/TIKA-3515
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 1.27
> Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302
>Reporter: Luís Filipe Nassif
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, 
> LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, 
> LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, 
> LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 
> PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, 
> image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png
>
>
> Some Korean chars are extracted as squares. The encodings of plain texts are 
> detected correctly. Maybe this is related with the content handler (just a 
> guess). I'll attach the triggering files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3510) tika-parser-scientific-module seems to embbed many dependencies

2021-08-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397488#comment-17397488
 ] 

Hudson commented on TIKA-3510:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #305 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/305/])
TIKA-3510 -- separate out modules/packages for tika-parsers-extended (tallison: 
[https://github.com/apache/tika/commit/509748b336a2ddc368a237d026fefeee57300325])
* (edit) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-module/pom.xml
* (edit) tika-parsers/tika-parsers-extended/pom.xml
* (add) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package/pom.xml
* (edit) CHANGES.txt
* (edit) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-module/pom.xml
* (edit) pom.xml
* (add) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/pom.xml


> tika-parser-scientific-module seems to embbed many dependencies
> ---
>
> Key: TIKA-3510
> URL: https://issues.apache.org/jira/browse/TIKA-3510
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: Thomas Mortagne
>Priority: Major
>
> tika-parser-scientific-module 2.0.0 contains many files from other artifacts:
> * joda-time
> * slf4j
> * commons-io
> * ...
> Is that really expected ?
> tika-parser-sqlite3-module seems to be affected too



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3520) Revert rendering only non-text elements in auto mode for PDFs

2021-08-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397489#comment-17397489
 ] 

Hudson commented on TIKA-3520:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #305 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/305/])
TIKA-3520 -- change default rendering option to ALL (tallison: 
[https://github.com/apache/tika/commit/0bf273a0b3635f9399c027dce7c031088abfb0e9])
* (edit) CHANGES.txt
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java


> Revert rendering only non-text elements in auto mode for PDFs
> -
>
> Key: TIKA-3520
> URL: https://issues.apache.org/jira/browse/TIKA-3520
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.1
>
>
> In Tika 2.0.0, we changed the default behavior for the AUTO mode to render 
> only non-text elements.  I now think we should revert this render the full 
> page, including text elements until we can come up with a better decision 
> process for automatically determining whether it would be better to render 
> the full page or only non-text elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: versions?

2021-08-11 Thread Nick Burch

On Wed, 11 Aug 2021, Tim Allison wrote:
A)  I think we should maintain the 1.x branch and continue to put out 
bug fixes for a bit. Any objections to nominally calling the next 
release 1.27.1 on JIRA at least?


I agree we should probably try to keep 1.x going for at least a few 
months, to allow people a chance to upgrade + make associated updates for 
the breaking changes. If nothing else, there is bound to be dependencies 
that need updates for security issues!


I'm +0 on not backporting new features or even new mime types, just bug 
fixes and security


Don't mind if we the next release 1.28 or 1.27.1


B) We've made quite a few changes in the main branch since the release
of 2.0.0.  Would there be any objections to incrementing the MINOR
version for the next release: 2.1.0?


I think 2.1.0 is probably worth using, given that most users will need to 
read the release notes, and some (but not all) users will need to make 
changes for the changed defaults etc


Nick


versions?

2021-08-11 Thread Tim Allison
All,
Two questions:

A)  I think we should maintain the 1.x branch and continue to put out
bug fixes for a bit. Any objections to nominally calling the next
release 1.27.1 on JIRA at least?

B) We've made quite a few changes in the main branch since the release
of 2.0.0.  Would there be any objections to incrementing the MINOR
version for the next release: 2.1.0?

Thank you.

Best,

  Tim

P.S. Apologies for the delay in the release of the next 2.x.  I've
been busy with other items.  I should have time to start the release
process tomorrow or so.


[jira] [Resolved] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-08-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3489.
---
Fix Version/s: 2.0.1
   1.27.1
 Assignee: Tim Allison
   Resolution: Fixed

Thank you, all!

> Robots.txt files frequently identified as message/rfc822
> 
>
> Key: TIKA-3489
> URL: https://issues.apache.org/jira/browse/TIKA-3489
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 2.0.0, 1.25, 1.26, 1.27
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.27.1, 2.0.1
>
> Attachments: robots.txt
>
>
> The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if 
> the file starts with a "User-Agent" rule and contains also a second rule not 
> too far away from the beginning, e.g.:
> {noformat}
> User-Agent: goodbot
> Disallow:
> User-Agent: badbot
> Disallow: /
> {noformat}
> The change 
> [7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13]
>  requires that two different clauses are matched. However, the two 
> occurrences of "User-Agent:" (initial and after a new line) are treated as 
> different instead of equivalent matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3520) Revert rendering only non-text elements in auto mode for PDFs

2021-08-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3520.
---
Fix Version/s: 2.0.1
 Assignee: Tim Allison
   Resolution: Fixed

> Revert rendering only non-text elements in auto mode for PDFs
> -
>
> Key: TIKA-3520
> URL: https://issues.apache.org/jira/browse/TIKA-3520
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.1
>
>
> In Tika 2.0.0, we changed the default behavior for the AUTO mode to render 
> only non-text elements.  I now think we should revert this render the full 
> page, including text elements until we can come up with a better decision 
> process for automatically determining whether it would be better to render 
> the full page or only non-text elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3515) Tika CLI -t should use UTF-8 as default output encoding

2021-08-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3515.
---
Fix Version/s: 2.0.1
 Assignee: Tim Allison
   Resolution: Fixed

> Tika CLI -t should use UTF-8 as default output encoding
> ---
>
> Key: TIKA-3515
> URL: https://issues.apache.org/jira/browse/TIKA-3515
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 1.27
> Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302
>Reporter: Luís Filipe Nassif
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, 
> LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, 
> LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, 
> LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 
> PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, 
> image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png
>
>
> Some Korean chars are extracted as squares. The encodings of plain texts are 
> detected correctly. Maybe this is related with the content handler (just a 
> guess). I'll attach the triggering files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3521) Move checkActive out of fetchemitworkers within AsyncProcessor

2021-08-11 Thread Tim Allison (Jira)
Tim Allison created TIKA-3521:
-

 Summary: Move checkActive out of fetchemitworkers within 
AsyncProcessor
 Key: TIKA-3521
 URL: https://issues.apache.org/jira/browse/TIKA-3521
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


The heartbeat check in AsyncProcessor is carried out by the parse threads.  
However, this check should continue through the life of the object.  The parse 
threads may complete, but the emitter threads may still be active.  Let's move 
this to a separate thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3483) Implement a network policy for Helm Chart

2021-08-11 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved TIKA-3483.

Resolution: Fixed

> Implement a network policy for Helm Chart
> -
>
> Key: TIKA-3483
> URL: https://issues.apache.org/jira/browse/TIKA-3483
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.1
>
>
> See https://github.com/apache/tika-helm/pull/5 for context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397440#comment-17397440
 ] 

Tim Allison commented on TIKA-3519:
---

Can you share an example file with me?

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>   
>  
>     type="int">2000
>  
>   
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3483) Implement a network policy for Helm Chart

2021-08-11 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-3483:
---
Fix Version/s: (was: 2.0.0-BETA)
   2.0.1

> Implement a network policy for Helm Chart
> -
>
> Key: TIKA-3483
> URL: https://issues.apache.org/jira/browse/TIKA-3483
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.1
>
>
> See https://github.com/apache/tika-helm/pull/5 for context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3483) Implement a network policy for Helm Chart

2021-08-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397439#comment-17397439
 ] 

ASF GitHub Bot commented on TIKA-3483:
--

lewismc merged pull request #5:
URL: https://github.com/apache/tika-helm/pull/5


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement a network policy for Helm Chart
> -
>
> Key: TIKA-3483
> URL: https://issues.apache.org/jira/browse/TIKA-3483
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> See https://github.com/apache/tika-helm/pull/5 for context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika-helm] lewismc merged pull request #5: [TIKA-3483] Implement a network policy for Helm Chart

2021-08-11 Thread GitBox


lewismc merged pull request #5:
URL: https://github.com/apache/tika-helm/pull/5


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3483) Implement a network policy for Helm Chart

2021-08-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397438#comment-17397438
 ] 

ASF GitHub Bot commented on TIKA-3483:
--

lewismc commented on pull request #5:
URL: https://github.com/apache/tika-helm/pull/5#issuecomment-896946612


   @bynare apologies I just ended up doing other things... I wasn't ignoring 
this. Thanks for your patience.
   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement a network policy for Helm Chart
> -
>
> Key: TIKA-3483
> URL: https://issues.apache.org/jira/browse/TIKA-3483
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> See https://github.com/apache/tika-helm/pull/5 for context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika-helm] lewismc commented on pull request #5: [TIKA-3483] Implement a network policy for Helm Chart

2021-08-11 Thread GitBox


lewismc commented on pull request #5:
URL: https://github.com/apache/tika-helm/pull/5#issuecomment-896946612


   @bynare apologies I just ended up doing other things... I wasn't ignoring 
this. Thanks for your patience.
   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-11 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397427#comment-17397427
 ] 

Xiaohong Yang commented on TIKA-3519:
-

Can you check if you can catch the above mentioned ByteArrayMaxOverride error 
(Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 14523048, but 500 is the maximum for this record type…), 
stop parsing and then write the available body content to the contenthandler so 
that we can have the body content parsed so far?

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>   
>  
>     type="int">2000
>  
>   
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3520) Revert rendering only non-text elements in auto mode for PDFs

2021-08-11 Thread Tim Allison (Jira)
Tim Allison created TIKA-3520:
-

 Summary: Revert rendering only non-text elements in auto mode for 
PDFs
 Key: TIKA-3520
 URL: https://issues.apache.org/jira/browse/TIKA-3520
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


In Tika 2.0.0, we changed the default behavior for the AUTO mode to render only 
non-text elements.  I now think we should revert this render the full page, 
including text elements until we can come up with a better decision process for 
automatically determining whether it would be better to render the full page or 
only non-text elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3515) Tika CLI -t should use UTF-8 as default output encoding

2021-08-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3515:
--
Affects Version/s: (was: 2.0.0-BETA)
   2.0.0

> Tika CLI -t should use UTF-8 as default output encoding
> ---
>
> Key: TIKA-3515
> URL: https://issues.apache.org/jira/browse/TIKA-3515
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 1.27
> Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302
>Reporter: Luís Filipe Nassif
>Priority: Minor
> Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, 
> LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, 
> LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, 
> LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 
> PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, 
> image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png
>
>
> Some Korean chars are extracted as squares. The encodings of plain texts are 
> detected correctly. Maybe this is related with the content handler (just a 
> guess). I'll attach the triggering files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3515) Tika CLI -t should use UTF-8 as default output encoding

2021-08-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397391#comment-17397391
 ] 

Tim Allison commented on TIKA-3515:
---

If we're going to make this change in tika-app, I think we should also 
deprecate the initialization of WriteOutContentHandler and ToTextContentHandler 
with only an outputstream because these call Charset.getDefaultCharset().

We can also clean up defaultcharset in some of our unit tests.  I'm concerned 
about what might happen if we try to change then in the translators...I'll 
leave those alone.

If anyone has objections to any of the above, let me know.

> Tika CLI -t should use UTF-8 as default output encoding
> ---
>
> Key: TIKA-3515
> URL: https://issues.apache.org/jira/browse/TIKA-3515
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.27, 2.0.0-BETA
> Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302
>Reporter: Luís Filipe Nassif
>Priority: Minor
> Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, 
> LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, 
> LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, 
> LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 
> PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, 
> image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png
>
>
> Some Korean chars are extracted as squares. The encodings of plain texts are 
> detected correctly. Maybe this is related with the content handler (just a 
> guess). I'll attach the triggering files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397365#comment-17397365
 ] 

Tim Allison commented on TIKA-3519:
---

If the underlying parser (Apache POI in this case) writes content to the 
contenthandler before a writelimitexception, you _should_ be able to retrieve 
that text and metadata.

If the underlying parser needs to parse the full file and hits this exception 
before writing to the contenthandler, then there's not much we can do.

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>   
>  
>     type="int">2000
>  
>   
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-11 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397345#comment-17397345
 ] 

Xiaohong Yang commented on TIKA-3519:
-

I tried org.apache.tika.sax.WriteOutContentHandler with writeLimit in a test 
program and found out that this is one of the features we want. However I 
noticed that this approach (setting of writeLimit) does not help to avoid the 
ByteArrayMaxOverride error mentioned in the ticket (Caused by: 
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 
14523048, but 500 is the maximum for this record type…).  I also noticed 
that if the ByteArrayMaxOverride error happens we do  not get any body text 
regardless the value of  writeLimit.

When the ByteArrayMaxOverride error happens we can catch the exception and get 
the required override value from the stack trace,  and then set the required 
override value with IOUtils.setByteArrayMaxOverride() and try the parse method 
again (it will probably succeed if the machine has enough memory).

However we wonder if you can add a feature so that the body text is still 
available when the ByteArrayMaxOverride error happens so that we can decide to 
try again or use the available body text (and metadata) depending on the 
required override value because a very higher value may not be feasible for 
reasons like there is not enough memory available on the machine.

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>   
>  
>     type="int">2000
>  
>   
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)