[jira] [Comment Edited] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045119#comment-14045119 ] Tilman Hausherr edited comment on TIKA-1300 at 6/27/14 6:18 AM: -

[jira] [Commented] (TIKA-1332) Create "eval" code

2014-06-26 Thread Matthias Krueger (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045219#comment-14045219 ] Matthias Krueger commented on TIKA-1332: It might be good to distinguish between th

[jira] [Comment Edited] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045119#comment-14045119 ] Tilman Hausherr edited comment on TIKA-1300 at 6/26/14 9:08 PM: -

[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045119#comment-14045119 ] Tilman Hausherr commented on TIKA-1300: --- My impression was that the NSP had better re

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-26 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044891#comment-14044891 ] Lewis John McGibbney commented on TIKA-1302: I would love to work with [~tpalsu

[jira] [Commented] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044881#comment-14044881 ] Tim Allison commented on TIKA-1233: --- Hindsight and current eval methodology turn out to b

[jira] [Closed] (TIKA-1298) testEmbeddedPDFEmbeddingAnotherDocument fails with PDFBox 1.8.5 and java 1.6

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1298. - Resolution: Fixed Turned test back on in PDFParser test. Thank you [~tilman]! > testEmbeddedPDFEmbedding

[jira] [Updated] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1300: -- Attachment: tika_1_6_ClassicsVsNonSeq.zip The attached shows the results of running Tika 1.6 trunk with

Julia wrapper around Tika

2014-06-26 Thread Mattmann, Chris A (3980)
Hey Guys, The Julia program language folks at MIT have created a Julia wrapper around Tika called Taro.jl: https://github.com/aviks/Taro.jl Woot. Tika is now available in the Julia programming language! Cheers, Chris ++ Chris Ma

[jira] [Commented] (TIKA-1332) Create "eval" code

2014-06-26 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044706#comment-14044706 ] Hong-Thai Nguyen commented on TIKA-1332: What you are describing is something alike

[jira] [Commented] (TIKA-1332) Create "eval" code

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044682#comment-14044682 ] Tim Allison commented on TIKA-1332: --- To my mind, there are three families of things that

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044668#comment-14044668 ] Tim Allison commented on TIKA-1302: --- Agreed. If there's a grad student with some time on

RE: Question re installing Tika

2014-06-26 Thread Richard
Thanks very much Chris ... its all working now. You haven't by chance happen to have programmatically looped through a directory full of pdfs and used Tika to extract each of their pdf contents into separate text or xml files? If so, what do you recommend to do the extraction? Kind regards Richar

[jira] [Commented] (TIKA-1288) Epub's content extracted partially

2014-06-26 Thread Jelle Kastelein (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044591#comment-14044591 ] Jelle Kastelein commented on TIKA-1288: --- Quite possibly the same issue: I'm not getti

[jira] [Commented] (TIKA-1358) Add support for newer iWork file formats

2014-06-26 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044468#comment-14044468 ] Nick Burch commented on TIKA-1358: -- First thing we'd probably want is to re-create the cur