[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-12-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229752#comment-14229752 ] Tim Allison commented on TIKA-1302: --- Thank you, [~jnioche]! I'm unpacking and staging no

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228305#comment-14228305 ] Julien Nioche commented on TIKA-1302: - FYI have extracted data from the CommonCrawl dat

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226418#comment-14226418 ] Chris A. Mattmann commented on TIKA-1302: - Sure Tim I'll help to get the scientific

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226415#comment-14226415 ] Andrew Jackson commented on TIKA-1302: -- We have two more sets of data. One is the same

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226397#comment-14226397 ] Julien Nioche commented on TIKA-1302: - Sure, will get back to you re-details of scp whe

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226393#comment-14226393 ] Tim Allison commented on TIKA-1302: --- Looks like I'll need to rm govdocs1 zips to clear so

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226368#comment-14226368 ] Chris A. Mattmann commented on TIKA-1302: - how about images and scientific data? we

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226336#comment-14226336 ] Julien Nioche commented on TIKA-1302: - Hi [~talli...@apache.org] It would be easy to do

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219371#comment-14219371 ] Tim Allison commented on TIKA-1302: --- HPC is way beyond current status of tika-batch, whic

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-19 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218825#comment-14218825 ] Tyler Palsulich commented on TIKA-1302: --- I just got access to an HPC cluster at NYU.

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-13 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209757#comment-14209757 ] Andrew Jackson commented on TIKA-1302: -- [~talli...@apache.org] I've created a download

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194543#comment-14194543 ] Tim Allison commented on TIKA-1302: --- [~anjackson], the google docs link is down at the mo

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-01 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193663#comment-14193663 ] Chris A. Mattmann commented on TIKA-1302: - I'd say extract the errors, we'd appreci

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-28 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186718#comment-14186718 ] Andrew Jackson commented on TIKA-1302: -- Shall I go ahead and extract the XML errors? O

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-21 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178361#comment-14178361 ] Andrew Jackson commented on TIKA-1302: -- Okay, so the c.300,000 exceptions are here: h

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177877#comment-14177877 ] Chris A. Mattmann commented on TIKA-1302: - [~anjackson] thanks for sharing. [~goste

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread William Palmer (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177058#comment-14177058 ] William Palmer commented on TIKA-1302: --  I have left the British Library (as of 20th

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177054#comment-14177054 ] Tim Allison commented on TIKA-1302: --- That would be a fantastic resource. Thank you for s

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176934#comment-14176934 ] Andrew Jackson commented on TIKA-1302: -- I have 2,358,167 errors from one collection (2

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176900#comment-14176900 ] Ken Krugler commented on TIKA-1302: --- Andrew - that sounds amazing! Could you provide an e

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176892#comment-14176892 ] Andrew Jackson commented on TIKA-1302: -- At the UK Web Archive we run Apache Tika over

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-10 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166931#comment-14166931 ] Tim Allison commented on TIKA-1302: --- I just transitioned development on TIKA-1302 subtask

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-27 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045781#comment-14045781 ] Tyler Palsulich commented on TIKA-1302: --- Hi [~lewismc] and [~talli...@apache.org]. I'

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-26 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044891#comment-14044891 ] Lewis John McGibbney commented on TIKA-1302: I would love to work with [~tpalsu

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044668#comment-14044668 ] Tim Allison commented on TIKA-1302: --- Agreed. If there's a grad student with some time on

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-25 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043561#comment-14043561 ] Lewis John McGibbney commented on TIKA-1302: [~tpalsulich] bq. So, we get the n

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-25 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043524#comment-14043524 ] Tyler Palsulich commented on TIKA-1302: --- Are there any updates with this? We have the

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-06 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020362#comment-14020362 ] Hudson commented on TIKA-1302: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #29 (See [https://bu

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-06 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020346#comment-14020346 ] Hudson commented on TIKA-1302: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #29 (See [https://bu

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-04 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018448#comment-14018448 ] Chris A. Mattmann commented on TIKA-1302: - +1 this sounds good to me, Tim. > Let's

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007348#comment-14007348 ] Tim Allison commented on TIKA-1302: --- Y, that's an important question. All depends on siz

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-23 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007199#comment-14007199 ] Chris A. Mattmann commented on TIKA-1302: - [~talli...@apache.org] this is a good qu

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007132#comment-14007132 ] Tim Allison commented on TIKA-1302: --- [~chrismattmann], [~gagravarr], [~lewismc] and All,

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006118#comment-14006118 ] Tim Allison commented on TIKA-1302: --- Ok, I think we might be talking about different thin

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-22 Thread Giuseppe Totaro (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006015#comment-14006015 ] Giuseppe Totaro commented on TIKA-1302: --- Hi Tim, I refer to metadata schema of each

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004592#comment-14004592 ] Tim Allison commented on TIKA-1302: --- [~jnioche], very cool corpus. My dream would be to r

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-20 Thread Giuseppe Totaro (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004103#comment-14004103 ] Giuseppe Totaro commented on TIKA-1302: --- Thank you Chris. I'm working with Tika agai

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-20 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004069#comment-14004069 ] Chris A. Mattmann commented on TIKA-1302: - GovDocs - Giuseppe Totaro from the Unive

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-20 Thread William Palmer (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002918#comment-14002918 ] William Palmer commented on TIKA-1302: -- Ross Spencer has made the openplanets format-c

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001612#comment-14001612 ] Julien Nioche commented on TIKA-1302: - How large do you want that batch to be? If we ar

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-19 Thread William Palmer (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001532#comment-14001532 ] William Palmer commented on TIKA-1302: -- This one might be worth a look - https://githu