I'd like to have some suggestion on how to improve the indexing performance on 
the following scenario
I'm uploading 1M docs to solr, 

every docs has
        id: sequential number
        title:  small string
        date: date
        body: 1kb of text

Here are my benchmarks (they are all single executions, not averages from 
multiple executions):

1)      using the updaterequesthandler 
        and streaming docs from a csv file on the same disk of solr
        auto commit every 15s with openSearcher=false and commit after last 
document    

        total time: 143035ms

1.1)    using the updaterequesthandler 
        and streaming docs from a csv file on the same disk of solr
        auto commit every 15s with openSearcher=false and commit after last 
document    
        <ramBufferSizeMB>500</ramBufferSizeMB>
        <maxBufferedDocs>100000</maxBufferedDocs>
        
        total time: 134493ms

1.2)    using the updaterequesthandler 
        and streaming docs from a csv file on the same disk of solr
        auto commit every 15s with openSearcher=false and commit after last 
document    
        <mergeFactor>30</mergeFactor>

        total time: 143134ms

2)      using a solrj client from another pc in the lan (100Mbps)
        with httpsolrserver
        with javabin format
        add documents to the server in batches of 1k docs       ( server.add( 
<collection> ) ) 
        auto commit every 15s with openSearcher=false and commit after last 
document            

        total time: 139022ms

3)      using a solrj client from another pc in the lan (100Mbps)
        with concurrentupdatesolrserver
        with javelin format
        add documents to the server in batches of 1k docs       ( server.add( 
<collection> ) ) 
        server queue size=20k
    server threads=4
        no auto-commit and commit every 100k docs

        total time: 167301ms


--On the solr server--
cpu averages    25%
        at best 100% for 1 core
IO      is still far from being saturated
        iostat gives a pattern like this (every 5 s)

        time(s)         %util
        100                     45,20
        105                     1,68
        110                     17,44
        115                     76,32
        120                     2,64
        125                     68
        130                     1,28

I thought that using concurrentupdatesolrserver I was able to max cpu or IO but 
I wasn't.
With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get an 
OutOfMemory error
and I found that committing every 100k docs gives worse performance than auto 
commit every 15s (benchmark 3 with httpsolrserver took 193515)

I'd really like to understand why I can't max out the resources on the server 
hosting solr (disk above all)
And I'd really like to understand what I'm doing wrong with 
concurrentupdatesolrserver

thanks

Reply via email to