Thank you Toke - yes - the data is indexed throughout the day. We are handling very few searches - probably 50 a day; this is an R&D system. Our HDFS cache, I believe, is too small at 10GBytes per shard. This comes out to 20GBytes of HDFS cache per physical machine plus about 10G each for the 2 JVMs running the shards. Each of those machines is also running other services which leaves very little RAM available for FS cache.

Current parameters for running each shard are:
JAVA_OPTS="-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m -XX:CMSFullGCsBeforeCompaction=1 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:CMSTriggerPermRatio=80 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+AggressiveOpts -XX:ParallelGCThreads=7 -Xmx10752m"

I'd love to try SSDs, but don't have the budget at present to go that route. I'd really like to get the HDFS option to work well as it reduces system complexity. It seems to me that if our HDFS cluster has lots/enough spindles, performance should be relatively good, as long as the OS can actually do some caching. We will be adding more HDFS nodes in the future, increasing spindle count and reducing the amount of data stored into Solr. When we redo our Solr Cloud, we will only run one shard per box, and supply more HDFS cache.

-Joe

On 1/7/2015 3:50 PM, Toke Eskildsen wrote:
Joseph Obernberger [j...@lovehorsepower.com] wrote:

[HDFS, 9M docs, 2.9TB, 22 shards, 11 bare metal boxes]

A typical query takes about 7 seconds to run, but we also do faceting
and clustering.  Those can take in the 3 - 5 minute range depends on
what was queried, but can be as little as 10 seconds. The index contains
about 100 fields.
7 seconds without faceting seems like a long time. I am guessing your 3M daily 
updates are spread throughout the day, instead of being a nightly batch job? 
How many concurrent searches are you handling?

We have no experience with HDFS for Solr indexes, but a quick check indicates 
that it is not a good fit for Solr. At least not out of the box: 
http://hbase.apache.org/book.html#perf.hdfs.curr

We did at one point try to use networked storage for our index. That meant 1/3 
performance, compared to local storage, but of course your mileage will vary. 
As you are looking into ways of improving performance, what about testing the 
performance difference with local storage (SSD of course)?

- Toke Eskildsen


Reply via email to