On 8/4/2011 12:38 AM, Bernd Fehling wrote:
Hi Shawn,

the 0.05 seconds for search time at peek times (3 qps) is my target for Solr. The numbers for solr are from Solr's statistic report page. So 39.5 seconds average per request is definately to long and I have to change to sharding.

Solr reports all query times in milliseconds.  39.5 would be 0.0395 seconds.

For FAST system the numbers for the search dispatcher are:
     0.042 sec elapsed per normal search, on avg.
     0.053 sec average uncached normal search time (last 100 queries).
     99.898% of searches using < 1 sec
     99.999% of searches using < 3 sec
     0.000% of all requests timed out
     22454567.577 sec time up (that is 259 days)

Is there a report page for those numbers for Solr?

The Solr statistics normally page reports averages, but not percentile statistics. You can add percentile-based statistics (on a limited subset of your queries) to a 3.X or trunk (4.0) version with SOLR-1972. I am using this patch in production. Alternatively, you can use INFO logging in Solr and crawl the logfiles to gather statistics. In the list below, (the "standard" section on the stats page) the ones that start with "rolling" are provided by the patch, the others are included by default. Remember that all these times are in milliseconds.

handlerStart : 1312433464327
requests : 24112
errors : 547
timeouts : 0
totalTime : 2565584
avgTimePerRequest : 106.40279
avgRequestsPerSecond : 0.7097045
rollingRequests : 16384
rollingTotalTime : 1594420
rollingAvgTimePerRequest : 97.315674
rollingAvgRequestsPerSecond : 0.74394274
rollingMedian : 16
rolling75thPercentile : 35
rolling95thPercentile : 225
rolling99thPercentile : 2202
rollingMax : 9397

About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM are -Xmx for Java. Yesterday I noticed that we are running out of heap during replication so I have to
increase -Xmx to about 22g.

That doesn't leave much RAM for the OS disk cache, the primary way to speed things up with Solr. You should check how long it takes to warm your caches when you commit, you can find that on the stats page. It's probably a good idea to lower your autowarmCount values.

If you sharded, you could drop your Java heap size and get more of your index into RAM. I have a heap size of 3GB for an 18.25GB index (total of all shards is about 110GB) and do not expect to be increasing that unless we have problems when we start using faceting, spellchecking, and suggestions. I have made particular tweaks to garbage collection and wrote about my experiences on this list. My memory-related java parameters:

-Xms3072M -Xmx3072M
-XX:NewSize=2048M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled

The reported 0.6 average requests per second seams to me right because
the Solr system isn't under full load yet. The FAST system is still taking most of the load. I plan to switch completely to Solr after sharding is up and
running stable. So there will be additional 3 qps to Solr at peek times.

When I read that before, I thought you were saying it was 0.6 seconds per request, not requests per second. My apologies. A qps of 3 is quite low. I've seen numbers mentioned here above 30000 qps, and I'm sure some of the list veterans have seen much higher.

I don't know if a controlling master like FAST makes any sense for Solr.
The small VMs with heartbeat and haproxy sounds great, must be on my todo list.

If you don't create a core to automatically add the shards parameter (the master server idea), your application will have to include the parameter on every request, which means it must be aware of how you have sharded your index. If that's acceptable to you, there's no problem. In my case, every single Solr instance has a copy of this broker core. I only use it on two of them, the two that the load balancer knows about.

But the biggest problem currently is, how to configure the DIH to split up the
content to several indexer. Is there an indexing distributor?

There is currently no way to have Solr figure out distributed indexing. Solr doesn't know how you have sharded your data, and it cannot keep track of primary/secondary indexers. Your build system must figure these things out. My dih-config.xml accepts variables via the URL, which I use to tailor my SQL queries.

        SELECT * FROM ${dataimporter.request.dataView}
        WHERE (
          (
            did &gt; ${dataimporter.request.minDid}
            AND did &lt;= ${dataimporter.request.maxDid}
          )
          ${dataimporter.request.extraWhere}
        ) AND (crc32(did) % ${dataimporter.request.numShards})
          IN (${dataimporter.request.modVal})

I index all new content to a smaller index which I have called the incremental. The updates that run every two minutes include a modVal for the above query of "0 1 2 3 4 5". Once a night, I figure out which content is older than one week. I index that content into the larger static shards and then delete it from the incremental. This ensures that commits happen quickly on the index where new content goes. Other processes that hit all shards (like deletes) run every ten minutes and check for document presence before they operate, so I will often go an hour or more between actual updates to the larger indexes, giving the Solr caches a longer lifetime.

Thanks,
Shawn

Reply via email to