Re: Improving indexing performance

Erick Erickson Mon, 07 Oct 2013 05:04:14 -0700

Just skimmed, but the usual reason you can't max out the server
is that the client can't go fast enough. Very quick experiment:
comment out the server.add line in your client and run it again,
does that speed up the client substantially? If not, then the time
is being spent on the client.


Or split your csv file into, say, 5 parts and run it from 5 different
PCs in parallel.

bq:  I can't rely on auto commit, otherwise I get an OutOfMemory error
This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
allocating more memory to the JVM running Solr.

bq: committing every 100k docs gives worse performance
It'll be best to specify openSearcher=false for max indexing throughput
BTW. You should be able to do this quite frequently, 15 seconds seems
quite reasonable.

Best,
Erick

On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla <matteo.gro...@gmail.com> wrote:
> I'd like to have some suggestion on how to improve the indexing performance 
> on the following scenario
> I'm uploading 1M docs to solr,
>
> every docs has
>         id: sequential number
>         title:  small string
>         date: date
>         body: 1kb of text
>
> Here are my benchmarks (they are all single executions, not averages from 
> multiple executions):
>
> 1)      using the updaterequesthandler
>         and streaming docs from a csv file on the same disk of solr
>         auto commit every 15s with openSearcher=false and commit after last 
> document
>
>         total time: 143035ms
>
> 1.1)    using the updaterequesthandler
>         and streaming docs from a csv file on the same disk of solr
>         auto commit every 15s with openSearcher=false and commit after last 
> document
>         <ramBufferSizeMB>500</ramBufferSizeMB>
>         <maxBufferedDocs>100000</maxBufferedDocs>
>
>         total time: 134493ms
>
> 1.2)    using the updaterequesthandler
>         and streaming docs from a csv file on the same disk of solr
>         auto commit every 15s with openSearcher=false and commit after last 
> document
>         <mergeFactor>30</mergeFactor>
>
>         total time: 143134ms
>
> 2)      using a solrj client from another pc in the lan (100Mbps)
>         with httpsolrserver
>         with javabin format
>         add documents to the server in batches of 1k docs       ( server.add( 
> <collection> ) )
>         auto commit every 15s with openSearcher=false and commit after last 
> document
>
>         total time: 139022ms
>
> 3)      using a solrj client from another pc in the lan (100Mbps)
>         with concurrentupdatesolrserver
>         with javelin format
>         add documents to the server in batches of 1k docs       ( server.add( 
> <collection> ) )
>         server queue size=20k
>     server threads=4
>         no auto-commit and commit every 100k docs
>
>         total time: 167301ms
>
>
> --On the solr server--
> cpu averages    25%
>         at best 100% for 1 core
> IO      is still far from being saturated
>         iostat gives a pattern like this (every 5 s)
>
>         time(s)         %util
>         100                     45,20
>         105                     1,68
>         110                     17,44
>         115                     76,32
>         120                     2,64
>         125                     68
>         130                     1,28
>
> I thought that using concurrentupdatesolrserver I was able to max cpu or IO 
> but I wasn't.
> With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get 
> an OutOfMemory error
> and I found that committing every 100k docs gives worse performance than auto 
> commit every 15s (benchmark 3 with httpsolrserver took 193515)
>
> I'd really like to understand why I can't max out the resources on the server 
> hosting solr (disk above all)
> And I'd really like to understand what I'm doing wrong with 
> concurrentupdatesolrserver
>
> thanks
>

Re: Improving indexing performance

Reply via email to