Grant,

In case there is normally only a single disk to read from, I think
reading the disk should be done by a single thread reading the
data as much as possible in the order that it is stored on disk.

Parsing into lucene docs and adding these docs could be done
in parallel. Last time I tried using multiple threads for IndexWriter.add()
the optimal number of threads was 3, but this was on a machine
with a single disk and a single CPU.

So I'd start with a design that has a single disk reading thread,
feeding a queue with docs to be parsed and added to an index.
This queue could then be read by a configurable number of threads,
probably 2-6.
With multiple disks, one could feed this queue using multiple threads,
one per independent disk.
For even more speed, one could also try and put the index on a different
disk.

I did not look at the code of EnwikiDocMaker, but I hope this helps
nonetheless.

Regards,
Paul Elschot


On Wednesday 09 January 2008 14:55:05 Grant Ingersoll wrote:
> As one can probably guess, I have been looking at the EnwikiDocMaker a  
> bit and using it outside of the benchmark suite, as related to the new  
> contrib/wikipedia stuff.   Just wanted to make sure I have a good  
> basic understanding of what it is doing, because I am looking for ways  
> to speed it up, so correct me if I am wrong, please:
> 
> The basic gist of it is, there is a background thread that gets kicked  
> off by the first next() call and is responsible for parsing and  
> loading the tuples one at a time, right?  Thus, the main  
> makeDocument() method waits until a tuple is available from this  
> thread and then it returns it once it is notified that one is  
> available, right?
> 
> As we've discussed in the past, the EnwikiDocMaker is a bottleneck in  
> the benchmark when it comes to running multiple indexing threads.  So,  
> I was thinking of a couple of different options and wanted to get an  
> opinion on what seems the most worthwhile to pursue:
> 
> 1. Implement a some sort of splitting version of the DocMaker that has  
> multiple threads, each responsible for parsing a certain section of  
> the file.  This would require us to know the number of documents ahead  
> of time, but that isn't a big deal, as one could either statically set  
> it, or write a little utility that counts the docs.  Thus, one could  
> either hide this in the doc maker or construct multiple doc makers,  
> each with their own range.   Taking this a step further, the utility  
> could output the file pointers where each range of documents starts,  
> so that each thread could skip ahead to that point (possibly, not sure  
> how that would work with a XML parser)
> 
> 2. Implement some sort of tuple buffering, whereby the reading thread  
> reads multiple documents at a time and buffers them, then makeDocument  
> can consume the buffer and only has to wait/exit when the buffer is  
> empty.  The producer thread could just work to fill the buffer at all  
> times unless it receives a quit message.
> 
> 3.  Split the large XML file into X smaller files and run them  
> independently.  Thus, if you have 4 threads, split the file into 4  
> files and treat them separately.  This is an easier to get right  
> version of #1.
> 
> Thoughts?
> 
> -Grant
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to