I think the fastest solution is to pre-process the XML file into a
"one doc per line" file (see the example createLineFile.alg).  That's
how I run my perf tests on Wikipedia.

Then, put the line file on a different drive than your index, if you
can.  Indexing a line file off an independent IO system should be
about as low overhead as we can hope for...

But if you don't want to do that, I think changing tuple to be a List
of tuples (using LinkedList) may be the best bet here (your option 2)?

I would make that change, and eg throttle the producer thread if the
length of that list is > N and resume the thread once it drops to
length N/2?  Then, run indexing w/ multiple threads and see if the
producer thread can keep up.  If it can keep up then I think you're
done.  If it can't keep up, then explore either option 1 or 3?

Mike

Grant Ingersoll wrote:

As one can probably guess, I have been looking at the EnwikiDocMaker a bit and using it outside of the benchmark suite, as related to the new contrib/wikipedia stuff. Just wanted to make sure I have a good basic understanding of what it is doing, because I am looking for ways to speed it up, so correct me if I am wrong, please:

The basic gist of it is, there is a background thread that gets kicked off by the first next() call and is responsible for parsing and loading the tuples one at a time, right? Thus, the main makeDocument() method waits until a tuple is available from this thread and then it returns it once it is notified that one is available, right?

As we've discussed in the past, the EnwikiDocMaker is a bottleneck in the benchmark when it comes to running multiple indexing threads. So, I was thinking of a couple of different options and wanted to get an opinion on what seems the most worthwhile to pursue:

1. Implement a some sort of splitting version of the DocMaker that has multiple threads, each responsible for parsing a certain section of the file. This would require us to know the number of documents ahead of time, but that isn't a big deal, as one could either statically set it, or write a little utility that counts the docs. Thus, one could either hide this in the doc maker or construct multiple doc makers, each with their own range. Taking this a step further, the utility could output the file pointers where each range of documents starts, so that each thread could skip ahead to that point (possibly, not sure how that would work with a XML parser)

2. Implement some sort of tuple buffering, whereby the reading thread reads multiple documents at a time and buffers them, then makeDocument can consume the buffer and only has to wait/exit when the buffer is empty. The producer thread could just work to fill the buffer at all times unless it receives a quit message.

3. Split the large XML file into X smaller files and run them independently. Thus, if you have 4 threads, split the file into 4 files and treat them separately. This is an easier to get right version of #1.

Thoughts?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to