[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134097#comment-17134097 ] Tim Allison commented on TIKA-3111: --- Not sure I follow. Text extraction seems to be the

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134100#comment-17134100 ] Tim Allison commented on TIKA-3111: --- Sorry, to clarify, we don’t get character counts fo

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134264#comment-17134264 ] Tilman Hausherr commented on TIKA-3111: --- Ignore my comment, it isn't helpful here, I

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134275#comment-17134275 ] Tilman Hausherr commented on TIKA-3111: --- Got it. PDFStreamEngine calls the (new) 4 p

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134289#comment-17134289 ] Tim Allison commented on TIKA-3111: --- Thank you! So, we should switch to PDFStreamEngine

[jira] [Comment Edited] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134264#comment-17134264 ] Tilman Hausherr edited comment on TIKA-3111 at 6/12/20, 3:09 PM: ---

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134303#comment-17134303 ] Tilman Hausherr commented on TIKA-3111: --- No, I got it to work with several changes i

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Stefan Bodewig (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134328#comment-17134328 ] Stefan Bodewig commented on TIKA-3110: -- The short answer is: yes. The longer version

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134345#comment-17134345 ] Andreas Lehmkühler commented on TIKA-3111: -- [~tilman] Yes, you're right the contr

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134371#comment-17134371 ] Tim Allison commented on TIKA-3110: --- [~bodewig], I'm so very, very grateful for your exp

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134381#comment-17134381 ] Christoph Läubrich commented on TIKA-3110: -- [~tallison] from an API point of view

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134385#comment-17134385 ] Tim Allison commented on TIKA-3110: --- Y, I completely agree with that. I made the decisi

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134389#comment-17134389 ] Christoph Läubrich commented on TIKA-3110: -- BTW: Commons.io has a foreMkDir maybe

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134396#comment-17134396 ] Tim Allison commented on TIKA-3110: --- Hahaha, y, I was going to point out something simil

[jira] [Comment Edited] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134396#comment-17134396 ] Tim Allison edited comment on TIKA-3110 at 6/12/20, 5:16 PM: -

[jira] [Comment Edited] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134396#comment-17134396 ] Tim Allison edited comment on TIKA-3110 at 6/12/20, 5:17 PM: -

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134403#comment-17134403 ] Tim Allison commented on TIKA-3110: --- We've been bitten by FileInputStream being wrong...

[jira] [Comment Edited] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134345#comment-17134345 ] Andreas Lehmkühler edited comment on TIKA-3111 at 6/12/20, 5:42 PM:

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134426#comment-17134426 ] Christoph Läubrich commented on TIKA-3110: -- If you are only concerned about FileI

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134475#comment-17134475 ] Tim Allison commented on TIKA-3110: --- Reverted in master; will cherry-pick to branch_1x e

[jira] [Comment Edited] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134475#comment-17134475 ] Tim Allison edited comment on TIKA-3110 at 6/12/20, 7:42 PM: -

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134482#comment-17134482 ] Tim Allison commented on TIKA-3110: --- The next issue that the file points to is that we a

[jira] [Created] (TIKA-3115) Detect parquet files

2020-06-12 Thread Tim Allison (Jira)
Tim Allison created TIKA-3115: - Summary: Detect parquet files Key: TIKA-3115 URL: https://issues.apache.org/jira/browse/TIKA-3115 Project: Tika Issue Type: Task Reporter: Tim Allison

[jira] [Commented] (TIKA-3115) Detect parquet files

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134493#comment-17134493 ] Tim Allison commented on TIKA-3115: --- https://parquet.apache.org/documentation/latest/ L

[jira] [Commented] (TIKA-3115) Detect parquet files

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134506#comment-17134506 ] Tim Allison commented on TIKA-3115: --- {{application/x-parquet}}? > Detect parquet files

[jira] [Commented] (TIKA-3115) Detect parquet files

2020-06-12 Thread Kenneth William Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134511#comment-17134511 ] Kenneth William Krugler commented on TIKA-3115: --- Sadly, 'PAR1' is about all

[jira] [Commented] (TIKA-3115) Detect parquet files

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134527#comment-17134527 ] Tim Allison commented on TIKA-3115: --- Thank you [~kkrugler]! If anyone wants to add a pa

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134530#comment-17134530 ] Hudson commented on TIKA-3110: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1822 (See [

[jira] [Commented] (TIKA-3115) Detect parquet files

2020-06-12 Thread Kenneth William Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134535#comment-17134535 ] Kenneth William Krugler commented on TIKA-3115: --- What would be the data you'

[jira] [Commented] (TIKA-3115) Detect parquet files

2020-06-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134538#comment-17134538 ] Tim Allison commented on TIKA-3115: --- My guess would be textify everything. The prob obv

[jira] [Commented] (TIKA-3115) Detect parquet files

2020-06-12 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134571#comment-17134571 ] Hudson commented on TIKA-3115: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1823 (See [

[jira] [Commented] (TIKA-3114) Error reading transcript from document

2020-06-12 Thread Dushyanth Balasubramanian (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134585#comment-17134585 ] Dushyanth Balasubramanian commented on TIKA-3114: - [~kkrugler] It's a pdf

[jira] [Commented] (TIKA-3114) Error reading transcript from document

2020-06-12 Thread Kenneth William Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134590#comment-17134590 ] Kenneth William Krugler commented on TIKA-3114: --- [~dbalasub] - unfortunately

[jira] [Commented] (TIKA-3114) Error reading transcript from document

2020-06-12 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134638#comment-17134638 ] Tilman Hausherr commented on TIKA-3114: --- [~dbalasub] Your stack trace does not conta