Improving indexing performance

Matteo Grolla Sun, 06 Oct 2013 09:20:07 -0700

I'd like to have some suggestion on how to improve the indexing performance on 
the following scenario
I'm uploading 1M docs to solr,


every docs has
        id: sequential number
        title:  small string
        date: date
        body: 1kb of text

Here are my benchmarks (they are all single executions, not averages from 
multiple executions):

1)      using the updaterequesthandler 
        and streaming docs from a csv file on the same disk of solr
        auto commit every 15s with openSearcher=false and commit after last 
document    

        total time: 143035ms

1.1)    using the updaterequesthandler 
        and streaming docs from a csv file on the same disk of solr
        auto commit every 15s with openSearcher=false and commit after last 
document    
        <ramBufferSizeMB>500</ramBufferSizeMB>
        <maxBufferedDocs>100000</maxBufferedDocs>
        
        total time: 134493ms

1.2)    using the updaterequesthandler 
        and streaming docs from a csv file on the same disk of solr
        auto commit every 15s with openSearcher=false and commit after last 
document    
        <mergeFactor>30</mergeFactor>

        total time: 143134ms

2)      using a solrj client from another pc in the lan (100Mbps)
        with httpsolrserver
        with javabin format
        add documents to the server in batches of 1k docs       ( server.add( 
<collection> ) ) 
        auto commit every 15s with openSearcher=false and commit after last 
document            

        total time: 139022ms

3)      using a solrj client from another pc in the lan (100Mbps)
        with concurrentupdatesolrserver
        with javelin format
        add documents to the server in batches of 1k docs       ( server.add( 
<collection> ) ) 
        server queue size=20k
    server threads=4
        no auto-commit and commit every 100k docs

        total time: 167301ms


--On the solr server--
cpu averages    25%
        at best 100% for 1 core
IO      is still far from being saturated
        iostat gives a pattern like this (every 5 s)

        time(s)         %util
        100                     45,20
        105                     1,68
        110                     17,44
        115                     76,32
        120                     2,64
        125                     68
        130                     1,28

I thought that using concurrentupdatesolrserver I was able to max cpu or IO but 
I wasn't.
With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get an 
OutOfMemory error
and I found that committing every 100k docs gives worse performance than auto 
commit every 15s (benchmark 3 with httpsolrserver took 193515)

I'd really like to understand why I can't max out the resources on the server 
hosting solr (disk above all)
And I'd really like to understand what I'm doing wrong with 
concurrentupdatesolrserver

thanks

Improving indexing performance

Reply via email to