Hi,

We have the following environment setup for zookeeper/solrcould

3 zookeeper ensemble
2 Solr cloud servers

I am writing you to further inquire about the interaction of solr and 
zookeeper. In particular relating to transactions in the transaction logs. I 
have a script running that logs the amount of transactions. I am matching this 
log with snapshot timing and new log creation.

After a problem arose in our PROD environment, I have tracked it to an 
unrecommended configuration where logs and data was kept on the same drive. 
Since then we have configured separate drives for logs and data in that 
environment. The behavior that caused the problem was when a snapshot was 
happening, a solr instance reported that it was unable to establish a ZK 
leader. Following that failure, during recovery,  4 more snapshots happened in 
short succession(10 minutes) on all 3 zk servers causing the whole environment 
to be unresponsive until restart for 1.5 hours.

I am currently working to recreate the problem and gather more information on 
the cause and impact of snapshots. I have configured a DEV environment with the 
same number of servers. I have changed the zk configuration to again have the 
logs and data in the same drive and directory. I am seeing that snapshots cause 
a degredation in performance due to IO block but I would like more information 
on transactions and snapshots to confirm this behavior and our suspicions.

Here are the scenarios I would like more information about:

1.       When the solr server is restarted, I see a huge influx of transactions 
on the zookeeper transaction log. What is the solr behavior that is causing 
this and is this normal?

2.       There is scenarios where snapshots are being created without reaching 
"snapCount" (snapCount=100000) transactions. I have documented snapshots at 17k 
and 45k transactions. In what scenarios would a snapshot be created other than 
reaching "snapCount" transactions?

3.       Since zk won't respond before writing to the transaction log... at 
Snapshot time(IO block) is it possible for the solr server to wait for a 
response from zk causing all other writes to be buffered resulting in a full 
heap and therefore an out of memory failure on the solr node?

a.       Now referencing question #1... When a solr node recovers, the influx 
of transactions plus the continuing writes seems to be enough to trigger 
another snapshot resulting in further downtime. Is this case plausible?

Thanks,
Jacek

Reply via email to