[
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007348#comment-14007348
]
Tim Allison edited comment on TIKA-1302 at 5/23/14 4:52 PM:
------------------------------------------------------------
Y, that's an important question. All depends on size of corpus and what we
want for processing time.
Let's assume we start with govdocs1 or a sample of it.
Complete back of envelope...
On my laptop (4 cores with -Xmx1g), it takes a multithreaded indexer ~40
seconds to index 1000 files from govdocs1 (let's assume the time to index is
roughly equivalent to the time it'll take to write out the diagnostic stuff
we'll want to record for each file).
That would be 10k files in 6.6 minutes, 100k files in a bit more than an hour
and 1M files in 11 hours.
So, if wanted to start small, we could start with 100k. The full govdocs1
takes up 470GB. A 100k sample would take up roughly 47GB.
We'd want probably (ballpark) 10x input corpus size to store the output so that
we can compare different versions of Tika. So, 0.5 TB. Let's double that for
some growth: 1 TB.
So, with a modest 4 cores, let's say 4 GB RAM, and 1 TB of storage, we could
run Tika against 100k files in a bit more than an hour. Add another few
minutes to compare output for comparison statistics.
***These numbers are based on a purely in-memory run. We'll probably want to
run against a server (not the public one, of course) so that'll add some to the
time.
Do these numbers jibe with what others are experiencing?
The big gotcha, of course, is that we'll want to harden the server and/or
create a server daemon to restart the server(s) for OOM and infinite hangs.
But I think those features are badly needed and this project will give good
motivation for these improvements.
was (Author: [email protected]):
Y, that's an important question. All depends on size of corpus and what we
want for processing time.
Let's assume we start with govdocs1 or a sample of it.
Complete back of envelope...
On my laptop (4 cores with -Xmx1g), it takes a multithreaded indexer ~40
seconds to index 1000 files from govdocs1 (let's assume the time to index is
roughly equivalent to the time it'll take to write out the diagnostic stuff
we'll want to record for each file).
That would be 10k files in 6.6 minutes, 100k files in a bit more than an hour
and 1M files in 11 hours.
So, if wanted to start small, we could start with 100k. The full govdocs1
takes up 470GB. A 100k sample would take up roughly 47GB.
We'd want probably (ballpark) 10x input corpus size to store the output so that
we can compare different versions of Tika. So, 0.5 TB. Let's double that for
some growth: 1 TB.
So, with a modest 4 cores and 1 TB of storage, we could run Tika against 100k
files in a bit more than an hour. Add another few minutes to compare output
for comparison statistics.
***These numbers are based on a purely in-memory run. We'll probably want to
run against a server (not the public one, of course) so that'll add some to the
time.
Do these numbers jibe with what others are experiencing?
The big gotcha, of course, is that we'll want to harden the server and/or
create a server daemon to restart the server(s) for OOM and infinite hangs.
But I think those features are badly needed and this project will give good
motivation for these improvements.
> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and
> running again, it might be fun to run Tika regularly against a large set of
> docs and report metrics.
> One excellent candidate corpus is govdocs1:
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?
> [~willp-bl], have anything handy you'd like to contribute?
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
> ;)
--
This message was sent by Atlassian JIRA
(v6.2#6252)