On 8/4/2011 12:38 AM, Bernd Fehling wrote:
Hi Shawn,
the 0.05 seconds for search time at peek times (3 qps) is my target
for Solr.
The numbers for solr are from Solr's statistic report page. So 39.5
seconds
average per request is definately to long and I have to change to
sharding.
Solr reports all query times in milliseconds. 39.5 would be 0.0395 seconds.
For FAST system the numbers for the search dispatcher are:
0.042 sec elapsed per normal search, on avg.
0.053 sec average uncached normal search time (last 100 queries).
99.898% of searches using < 1 sec
99.999% of searches using < 3 sec
0.000% of all requests timed out
22454567.577 sec time up (that is 259 days)
Is there a report page for those numbers for Solr?
The Solr statistics normally page reports averages, but not percentile
statistics. You can add percentile-based statistics (on a limited
subset of your queries) to a 3.X or trunk (4.0) version with SOLR-1972.
I am using this patch in production. Alternatively, you can use INFO
logging in Solr and crawl the logfiles to gather statistics. In the
list below, (the "standard" section on the stats page) the ones that
start with "rolling" are provided by the patch, the others are included
by default. Remember that all these times are in milliseconds.
handlerStart : 1312433464327
requests : 24112
errors : 547
timeouts : 0
totalTime : 2565584
avgTimePerRequest : 106.40279
avgRequestsPerSecond : 0.7097045
rollingRequests : 16384
rollingTotalTime : 1594420
rollingAvgTimePerRequest : 97.315674
rollingAvgRequestsPerSecond : 0.74394274
rollingMedian : 16
rolling75thPercentile : 35
rolling95thPercentile : 225
rolling99thPercentile : 2202
rollingMax : 9397
About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM
are -Xmx for Java.
Yesterday I noticed that we are running out of heap during replication
so I have to
increase -Xmx to about 22g.
That doesn't leave much RAM for the OS disk cache, the primary way to
speed things up with Solr. You should check how long it takes to warm
your caches when you commit, you can find that on the stats page. It's
probably a good idea to lower your autowarmCount values.
If you sharded, you could drop your Java heap size and get more of your
index into RAM. I have a heap size of 3GB for an 18.25GB index (total
of all shards is about 110GB) and do not expect to be increasing that
unless we have problems when we start using faceting, spellchecking, and
suggestions. I have made particular tweaks to garbage collection and
wrote about my experiences on this list. My memory-related java parameters:
-Xms3072M -Xmx3072M
-XX:NewSize=2048M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
The reported 0.6 average requests per second seams to me right because
the Solr system isn't under full load yet. The FAST system is still
taking
most of the load. I plan to switch completely to Solr after sharding
is up and
running stable. So there will be additional 3 qps to Solr at peek times.
When I read that before, I thought you were saying it was 0.6 seconds
per request, not requests per second. My apologies. A qps of 3 is
quite low. I've seen numbers mentioned here above 30000 qps, and I'm
sure some of the list veterans have seen much higher.
I don't know if a controlling master like FAST makes any sense for Solr.
The small VMs with heartbeat and haproxy sounds great, must be on my
todo list.
If you don't create a core to automatically add the shards parameter
(the master server idea), your application will have to include the
parameter on every request, which means it must be aware of how you have
sharded your index. If that's acceptable to you, there's no problem.
In my case, every single Solr instance has a copy of this broker core.
I only use it on two of them, the two that the load balancer knows about.
But the biggest problem currently is, how to configure the DIH to
split up the
content to several indexer. Is there an indexing distributor?
There is currently no way to have Solr figure out distributed indexing.
Solr doesn't know how you have sharded your data, and it cannot keep
track of primary/secondary indexers. Your build system must figure
these things out. My dih-config.xml accepts variables via the URL,
which I use to tailor my SQL queries.
SELECT * FROM ${dataimporter.request.dataView}
WHERE (
(
did > ${dataimporter.request.minDid}
AND did <= ${dataimporter.request.maxDid}
)
${dataimporter.request.extraWhere}
) AND (crc32(did) % ${dataimporter.request.numShards})
IN (${dataimporter.request.modVal})
I index all new content to a smaller index which I have called the
incremental. The updates that run every two minutes include a modVal
for the above query of "0 1 2 3 4 5". Once a night, I figure out which
content is older than one week. I index that content into the larger
static shards and then delete it from the incremental. This ensures
that commits happen quickly on the index where new content goes. Other
processes that hit all shards (like deletes) run every ten minutes and
check for document presence before they operate, so I will often go an
hour or more between actual updates to the larger indexes, giving the
Solr caches a longer lifetime.
Thanks,
Shawn