[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001612#comment-14001612
 ] 

Julien Nioche commented on TIKA-1302:
-------------------------------------

How large do you want that batch to be? If we are talking millions of pages, 
one option would be to use the Tika module of  Behemoth on the CommonCrawl 
dataset. See 
[http://digitalpebble.blogspot.co.uk/2011/05/processing-enron-dataset-using-behemoth.html]
 for a comparable  work we did some time ago on the Enron dataset. Behemoth 
already has a module for ingesting data from CommonCrawl. This means of course 
having Hadoop up and running.

Alternatively it would be simple to extract the documents from the CC dataset 
into the server's filesystem and use the TikaServer without Hadoop. Not sure 
what the legal implications of using these documents would be though.

The beauty of using the CommonCrawl dataset is that apart from volume, it is a 
good sample of the web with all the weird and beautiful things it contains 
(broken documents, large ones, etc...)





> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to