Re: performance crossover between single index and sharding

Shawn Heisey Thu, 04 Aug 2011 07:32:36 -0700

On 8/4/2011 12:38 AM, Bernd Fehling wrote:

Hi Shawn,
the 0.05 seconds for search time at peek times (3 qps) is my targetfor Solr.The numbers for solr are from Solr's statistic report page. So 39.5secondsaverage per request is definately to long and I have to change tosharding.


Solr reports all query times in milliseconds.  39.5 would be 0.0395 seconds.

For FAST system the numbers for the search dispatcher are:
     0.042 sec elapsed per normal search, on avg.
     0.053 sec average uncached normal search time (last 100 queries).
     99.898% of searches using < 1 sec
     99.999% of searches using < 3 sec
     0.000% of all requests timed out
     22454567.577 sec time up (that is 259 days)

Is there a report page for those numbers for Solr?

The Solr statistics normally page reports averages, but not percentilestatistics. You can add percentile-based statistics (on a limitedsubset of your queries) to a 3.X or trunk (4.0) version with SOLR-1972.I am using this patch in production. Alternatively, you can use INFOlogging in Solr and crawl the logfiles to gather statistics. In thelist below, (the "standard" section on the stats page) the ones thatstart with "rolling" are provided by the patch, the others are includedby default. Remember that all these times are in milliseconds.


handlerStart : 1312433464327
requests : 24112
errors : 547
timeouts : 0
totalTime : 2565584
avgTimePerRequest : 106.40279
avgRequestsPerSecond : 0.7097045
rollingRequests : 16384
rollingTotalTime : 1594420
rollingAvgTimePerRequest : 97.315674
rollingAvgRequestsPerSecond : 0.74394274
rollingMedian : 16
rolling75thPercentile : 35
rolling95thPercentile : 225
rolling99thPercentile : 2202
rollingMax : 9397

About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAMare -Xmx for Java.Yesterday I noticed that we are running out of heap during replicationso I have to
increase -Xmx to about 22g.

That doesn't leave much RAM for the OS disk cache, the primary way tospeed things up with Solr. You should check how long it takes to warmyour caches when you commit, you can find that on the stats page. It'sprobably a good idea to lower your autowarmCount values.

If you sharded, you could drop your Java heap size and get more of yourindex into RAM. I have a heap size of 3GB for an 18.25GB index (totalof all shards is about 110GB) and do not expect to be increasing thatunless we have problems when we start using faceting, spellchecking, andsuggestions. I have made particular tweaks to garbage collection andwrote about my experiences on this list. My memory-related java parameters:


-Xms3072M -Xmx3072M
-XX:NewSize=2048M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled

The reported 0.6 average requests per second seams to me right because
the Solr system isn't under full load yet. The FAST system is stilltakingmost of the load. I plan to switch completely to Solr after shardingis up and
running stable. So there will be additional 3 qps to Solr at peek times.

When I read that before, I thought you were saying it was 0.6 secondsper request, not requests per second. My apologies. A qps of 3 isquite low. I've seen numbers mentioned here above 30000 qps, and I'msure some of the list veterans have seen much higher.

I don't know if a controlling master like FAST makes any sense for Solr.
The small VMs with heartbeat and haproxy sounds great, must be on mytodo list.

If you don't create a core to automatically add the shards parameter(the master server idea), your application will have to include theparameter on every request, which means it must be aware of how you havesharded your index. If that's acceptable to you, there's no problem.In my case, every single Solr instance has a copy of this broker core.I only use it on two of them, the two that the load balancer knows about.

But the biggest problem currently is, how to configure the DIH tosplit up the
content to several indexer. Is there an indexing distributor?

There is currently no way to have Solr figure out distributed indexing.Solr doesn't know how you have sharded your data, and it cannot keeptrack of primary/secondary indexers. Your build system must figurethese things out. My dih-config.xml accepts variables via the URL,which I use to tailor my SQL queries.


        SELECT * FROM ${dataimporter.request.dataView}
        WHERE (
          (
            did &gt; ${dataimporter.request.minDid}
            AND did &lt;= ${dataimporter.request.maxDid}
          )
          ${dataimporter.request.extraWhere}
        ) AND (crc32(did) % ${dataimporter.request.numShards})
          IN (${dataimporter.request.modVal})

I index all new content to a smaller index which I have called theincremental. The updates that run every two minutes include a modValfor the above query of "0 1 2 3 4 5". Once a night, I figure out whichcontent is older than one week. I index that content into the largerstatic shards and then delete it from the incremental. This ensuresthat commits happen quickly on the index where new content goes. Otherprocesses that hit all shards (like deletes) run every ten minutes andcheck for document presence before they operate, so I will often go anhour or more between actual updates to the larger indexes, giving theSolr caches a longer lifetime.


Thanks,
Shawn

Re: performance crossover between single index and sharding

Reply via email to