I am having some difficulty migrating our solr indexing scripts from using 3.5 
to solr 4.0. Notably, I am trying to track down why our performance in solr 4.0 
is about 5-10 times slower when indexing documents. Querying is still quite 
fast.

The code adds  documents in groups of 1000, and adds each group to the solr in 
a thread. The documents are somewhat large, including maybe 30-40 different 
field types, mostly multivalued. Here are some snippets of the code we used in 
3.5.


 MultiThreadedHttpConnectionManager mgr = new 
MultiThreadedHttpConnectionManager();

 HttpClient client = new HttpClient(mgr);

 CommonsHttpSolrServer server = new CommonsHttpSolrServer( "some url for our 
index",client );

 server.setRequestWriter(new BinaryRequestWriter());


 Then, we delete the index, and proceed to generate documents and load the 
groups in a thread that looks kind of like this. I've omitted some overhead for 
handling exceptions, and retry attempts.


class DocWriterThread implements Runnable

{

    CommonsHttpSolrServer server;

    Collection<SolrInputDocument> docs;

    private int commitWithin = 50000; // 50 seconds

    public DocWriterThread(CommonsHttpSolrServer 
server,Collection<SolrInputDocument> docs)

    {

    this.server=server;

    this.docs=docs;

    }

public void run()

{

    // set the commitWithin feature

    server.add(docs,commitWithin);

}

}


Now, I've had to change some things to get this compile with the Solr 4.0 
libraries. Here is what I tried to convert the above code to. I don't know if 
these are the correct equivalents, as I am not familiar with apache 
httpcomponents.



 ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager();

 DefaultHttpClient client = new DefaultHttpClient(mgr);

 HttpSolrServer server = new HttpSolrServer( "some url for our solr 
index",client );

 server.setRequestWriter(new BinaryRequestWriter());




The thread method is the same, but uses HttpSolrServer instead of 
CommonsHttpSolrServer.

We also, had an old solrconfig (not sure what version, but it is pre 3.x and 
had mostly default values) that I had to replace with a 4.0 style 
solrconfig.xml. I don't want to post the entire file (as it is large), but I 
copied one from the solr 4.0 examples, and made a couple changes. First, I 
wanted to turn off transaction logging. So essentially I have a line like this 
(everything inside is commented out):


<updateHandler class="solr.DirectUpdateHandler2"></updateHandler>


And I added a handler for javabin


<requestHandler name="/update/javabin" class="solr.BinaryUpdateRequestHandler">

        <lst name="defaults">

         <str name="stream.contentType">application/javabin</str>

       </lst>

  </requestHandler>

I'm not sure what other configurations I should look at. I would think that 
there should be a big obvious reason why the indexing performance would drop 
nearly 10 fold.

Against our 3.5 instance I timed our index load, and it adds roughly 40,000 
documents every 3-8 seconds.

Against our 4.0 instance it adds 40,000 documents every 70-75 seconds.

This isn't the end of the world, and I would love to use the new join feature 
in solr 4.0. However, we have many different indexes with millions of 
documents, and this kind of increase in load time is troubling.


Thanks for your help.


-Kevin


The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.

Reply via email to