RE: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Steven Parkes Mon, 02 Apr 2007 14:51:10 -0700

Yes, indeed.  May not be necessary initially, but we could support  
XPath or something down the road to allow us to specify what things  
> I wouldn't worry about generalizing too much  
> to start with.  Once we have a couple collections then we can go that


> route.

My thoughts, too.

I've been looking at the Reuters stuff. It uncompressed the distribution
and then creates per-article files. I can't decide if I think that's a
good idea for Wikipedia. It's big (about 10G uncompressed) and has about
1.2M files (so I've heard; unverified).

On the one hand, creating separate per-article files is "clean" in that
when you then ingest, you only have disk i/o that's going to affect the
ingest performance (as opposed to, say, uncompressing/parsing). On the
other hand, that's a lot of disk i/o (compresses by about 5X) and a lot
of directory lookups.

Anybody have any opinions/relevant past experience?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Reply via email to