[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

Tim Allison (JIRA) Fri, 23 May 2014 09:53:45 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007348#comment-14007348
 ]


Tim Allison edited comment on TIKA-1302 at 5/23/14 4:52 PM:
------------------------------------------------------------

Y, that's an important question.  All depends on size of corpus and what we 
want for processing time.

Let's assume we start with govdocs1 or a sample of it.

Complete back of envelope...

On my laptop (4 cores with -Xmx1g), it takes a multithreaded indexer ~40 
seconds to index 1000 files from govdocs1 (let's assume the time to index is 
roughly equivalent to the time it'll take to write out the diagnostic stuff 
we'll want to record for each file).

That would be 10k files in 6.6 minutes, 100k files in a bit more than an hour 
and 1M files in 11 hours.

So, if wanted to start small, we could start with 100k.  The full govdocs1 
takes up 470GB.  A 100k sample would take up roughly 47GB.

We'd want probably (ballpark) 10x input corpus size to store the output so that 
we can compare different versions of Tika.  So, 0.5 TB.  Let's double that for 
some growth: 1 TB.

So, with a modest 4 cores, let's say 4 GB RAM, and 1 TB of storage, we could 
run Tika against 100k files in a bit more than an hour.  Add another few 
minutes to compare output for comparison statistics.

***These numbers are based on a purely in-memory run.  We'll probably want to 
run against a server (not the public one, of course) so that'll add some to the 
time.

Do these numbers jibe with what others are experiencing?

The big gotcha, of course, is that we'll want to harden the server and/or 
create a server daemon to restart the server(s) for OOM and infinite hangs.  
But I think those features are badly needed and this project will give good 
motivation for these improvements.




was (Author: talli...@mitre.org):
Y, that's an important question.  All depends on size of corpus and what we 
want for processing time.

Let's assume we start with govdocs1 or a sample of it.

Complete back of envelope...

On my laptop (4 cores with -Xmx1g), it takes a multithreaded indexer ~40 
seconds to index 1000 files from govdocs1 (let's assume the time to index is 
roughly equivalent to the time it'll take to write out the diagnostic stuff 
we'll want to record for each file).

That would be 10k files in 6.6 minutes, 100k files in a bit more than an hour 
and 1M files in 11 hours.

So, if wanted to start small, we could start with 100k.  The full govdocs1 
takes up 470GB.  A 100k sample would take up roughly 47GB.

We'd want probably (ballpark) 10x input corpus size to store the output so that 
we can compare different versions of Tika.  So, 0.5 TB.  Let's double that for 
some growth: 1 TB.

So, with a modest 4 cores and 1 TB of storage, we could run Tika against 100k 
files in a bit more than an hour.  Add another few minutes to compare output 
for comparison statistics.

***These numbers are based on a purely in-memory run.  We'll probably want to 
run against a server (not the public one, of course) so that'll add some to the 
time.

Do these numbers jibe with what others are experiencing?

The big gotcha, of course, is that we'll want to harden the server and/or 
create a server daemon to restart the server(s) for OOM and infinite hangs.  
But I think those features are badly needed and this project will give good 
motivation for these improvements.



> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

Reply via email to