Hey Solr users,
I'd really appreciate some community advice if somebody can spare some time to assist me. My question relates to initially deleting a large amount of unwanted data from a Solr Cloud collection, and then advice on best patterns for managing delete operations on a regular basis. We have a situation where data in our index can be 're-mastered' and as a result orphan records are left dormant and unneeded in the index (think of a scenario similar to client resolution where an entity can switch between golden records depending on the information available at the time). I'm considering removing these dormant records with a large initial bulk delete, and then running a delete process on a regular maintenance basis. The initial record backlog is ~50million records in a ~1.2billion document index (~4%) and the maintenance deletes are small in comparison ~20,000/week. So with this scenario in mind I'm wondering what my best approach is for the initial bulk delete: 1. Do nothing with the initial backlog and remove the unwanted documents during the next large reindexing process? 2. Delete by query (DBQ) with a specific delete query using the document id's? 3. Delete by id (DBID)? Are there any significant performance advantages between using DBID over a specific DBQ? Should I break the delete operations up into batches of say 1000, 10000, 100000, N DOC_ID's at a time if I take this approach? The Solr Reference guide mentions DBQ ignores the commitWithin parameter but you can specify multiple documents to remove with an OR (||) clause in a DBQ i.e. Option 1 – Delete by id {"delete":["<id1>","<id2>"]} Option 2 – Delete by query (commitWithin ignored) {"delete":{"query":"DOC_ID:(<id1> || <id2>)"}} Shawn also provides a great explanation in this user group post from 2015 of the DBQ process (https://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html) I follow the Solr release notes fairly closely and also noticed this excellent addition and discussion from Hossman and committers in the Solr 8.5 release and it looks ideal for this scenario (https://issues.apache.org/jira/browse/SOLR-14241). Unfortunately we're still on the 7.7.2 branch and are unable to take advantage of the streaming deletes feature. If I do implement a weekly delete maintenance regime is there any advice the community can offer from experience? I'll definitely want to avoid times of heavy indexing but how do deletes effect query performance? Will users notice decreased performance during delete operations so they should be avoided during peak query windows as well? As always any advice greatly is appreciated, Thanks, Dwane Environment SolrCloud 7.7.2, 30 shards, 2 replicas ~3 qps during peak times