Re: EnwikiDocMaker ?

Michael McCandless Wed, 09 Jan 2008 06:51:31 -0800


I think the fastest solution is to pre-process the XML file into a
"one doc per line" file (see the example createLineFile.alg).  That's
how I run my perf tests on Wikipedia.


Then, put the line file on a different drive than your index, if you
can.  Indexing a line file off an independent IO system should be
about as low overhead as we can hope for...

But if you don't want to do that, I think changing tuple to be a List
of tuples (using LinkedList) may be the best bet here (your option 2)?

I would make that change, and eg throttle the producer thread if the
length of that list is > N and resume the thread once it drops to
length N/2?  Then, run indexing w/ multiple threads and see if the
producer thread can keep up.  If it can keep up then I think you're
done.  If it can't keep up, then explore either option 1 or 3?

Mike

Grant Ingersoll wrote:

As one can probably guess, I have been looking at theEnwikiDocMaker a bit and using it outside of the benchmark suite,as related to the new contrib/wikipedia stuff. Just wanted tomake sure I have a good basic understanding of what it is doing,because I am looking for ways to speed it up, so correct me if I amwrong, please:
The basic gist of it is, there is a background thread that getskicked off by the first next() call and is responsible for parsingand loading the tuples one at a time, right? Thus, the mainmakeDocument() method waits until a tuple is available from thisthread and then it returns it once it is notified that one isavailable, right?
As we've discussed in the past, the EnwikiDocMaker is a bottleneckin the benchmark when it comes to running multiple indexingthreads. So, I was thinking of a couple of different options andwanted to get an opinion on what seems the most worthwhile to pursue:
1. Implement a some sort of splitting version of the DocMaker thathas multiple threads, each responsible for parsing a certainsection of the file. This would require us to know the number ofdocuments ahead of time, but that isn't a big deal, as one couldeither statically set it, or write a little utility that counts thedocs. Thus, one could either hide this in the doc maker orconstruct multiple doc makers, each with their own range. Takingthis a step further, the utility could output the file pointerswhere each range of documents starts, so that each thread couldskip ahead to that point (possibly, not sure how that would workwith a XML parser)
2. Implement some sort of tuple buffering, whereby the readingthread reads multiple documents at a time and buffers them, thenmakeDocument can consume the buffer and only has to wait/exit whenthe buffer is empty. The producer thread could just work to fillthe buffer at all times unless it receives a quit message.
3. Split the large XML file into X smaller files and run themindependently. Thus, if you have 4 threads, split the file into 4files and treat them separately. This is an easier to get rightversion of #1.
Thoughts?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: EnwikiDocMaker ?

Reply via email to