Hi Erick,
We have following configuration of our solr cloud

   1. 10 Shards
   2. 15 replicas per shard
   3. 9 GB of index size per shard
   4. a total of around 90 mil documents
   5. 2 collection viz search1 serving live traffic and search 2 for
   indexing. We swap collection when indexing finishes
   6. On 150 hosts we have 2 JVMs running one for search1 collection and
   other for search2 collection
   7. Each jvm has 12 GB of heap assigned to it while the host has 50GB in
   8. Each host has 16 processors
   9. Linux XXXXXXX 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43
   UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
   10. We have two ways to index data.
   1. Bulk indexing . All 90 million docs pumped in from 14 parallel
      process (on 14 different client hosts). This is done on
collection that is
      not serving live traffic
      2.  Incremental indexing . Only delta changes (Range from 100K to 5
      Mil) every two hours. This is done on collection also serving live traffic
   11. The request per second count on live collection is around 300 TPS
   12. Hard commit setting is every 30 second with open searcher false and
   soft commit setting is every 15 minutes . We have tried a lot of different
   setting here BTW.

Now we have two issues with indexing
1) Solr just could not keep up with the bulk indexing when replicas are
also active. We have concluded this by changing the number of replicas to
just 2 , to 4 and then to 15. When the number of replicas increases the
bulk indexing time increase almost exponentially
We seem to have encountered the same issue reported here
It gets to a point that even to index 100 docs the solr cluster would take
300 second. It would start of indexing 100 docs in 55 millisecond and
slowly increase over time and within hour and a half just could not keep
up. We have a workaround for this and i.e we stop all the replicas , do the
bulk indexing and bring all the replicas up one by one . This sort of
defeats the purpose of solr cloud but we can still work with this
workaround. We can do this because , bulk indexing happen on the collection
that is not serving live traffic. However we would love to have a solution
from the solr cloud itself like ask it to stop replication and start via an
API at the end of indexing.

2) This issues is related to soft commit with incremental indexing . When
we do incremental indexing, it is done on the same collection serving live
traffic with 300 request per second throughput.  Everything is fine except
whenever the soft commit happens. Each time soft commit (autosoftcommit in
sorlconfig.xml) happens which BTW happens almost at the same time
throughout the cluster , there is a spike in the response times and
throughput decreases almost to 150 tps. The spike continues for 2 minutes
and then it happens again at the exact interval when the soft commit
happens. We have monitored the logs and found a direct co relation when the
soft commit happens and when the response time tanks.

Now the latter issue is quite disturbing , because it is serving live
traffic and we cannot sustain these periodic degradation. We have played
around with different soft commit setting . Interval ranging from 2 minutes
to 30 minutes . Auto warming half cache  , auto warming full cache, auto
warming only 10 %. Doing warm up queries on every new searcher , doing NONE
warm up queries on every new searching and all the different setting yields
the same results . As and when soft commit happens the response time tanks
and throughput deceases. The difference is almost 50 % in response times
and 50 % in throughput

Our workaround for this solution is to also do incremental delta indexing
on the collection not serving live traffic and swap when it is done. As you
can see that this also defeats the purpose of solr cloud . We cannot do
bulk indexing because replicas cannot keeps up and we cannot do incremental
indexing because of soft commit performance.

Is there a way to make the cluster not do soft commit all at the same time
or is there a way to make soft commit not cause this degradation ?
We are open to any ideas at this time now.

Vijay Sekhri

Reply via email to