Hey Solr users,


I'd really appreciate some community advice if somebody can spare some time to 
assist me.  My question relates to initially deleting a large amount of 
unwanted data from a Solr Cloud collection, and then advice on best patterns 
for managing delete operations on a regular basis.   We have a situation where 
data in our index can be 're-mastered' and as a result orphan records are left 
dormant and unneeded in the index (think of a scenario similar to client 
resolution where an entity can switch between golden records depending on the 
information available at the time).  I'm considering removing these dormant 
records with a large initial bulk delete, and then running a delete process on 
a regular maintenance basis.  The initial record backlog is ~50million records 
in a ~1.2billion document index (~4%) and the maintenance deletes are small in 
comparison ~20,000/week.



So with this scenario in mind I'm wondering what my best approach is for the 
initial bulk delete:

  1.  Do nothing with the initial backlog and remove the unwanted documents 
during the next large reindexing process?
  2.  Delete by query (DBQ) with a specific delete query using the document 
id's?
  3.  Delete by id (DBID)?

Are there any significant performance advantages between using DBID over a 
specific DBQ? Should I break the delete operations up into batches of say 1000, 
10000, 100000, N DOC_ID's at a time if I take this approach?



The Solr Reference guide mentions DBQ ignores the commitWithin parameter but 
you can specify multiple documents to remove with an OR (||) clause in a DBQ 
i.e.


Option 1 – Delete by id

{"delete":["<id1>","<id2>"]}



Option 2 – Delete by query (commitWithin ignored)

{"delete":{"query":"DOC_ID:(<id1> || <id2>)"}}



Shawn also provides a great explanation in this user group post from 2015 of 
the DBQ process 
(https://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html)



I follow the Solr release notes fairly closely and also noticed this excellent 
addition and discussion from Hossman and committers in the Solr 8.5 release and 
it looks ideal for this scenario 
(https://issues.apache.org/jira/browse/SOLR-14241).  Unfortunately we're still 
on the 7.7.2 branch and are unable to take advantage of the streaming deletes 
feature.



If I do implement a weekly delete maintenance regime is there any advice the 
community can offer from experience?  I'll definitely want to avoid times of 
heavy indexing but how do deletes effect query performance?  Will users notice 
decreased performance during delete operations so they should be avoided during 
peak query windows as well?



As always any advice greatly is appreciated,



Thanks,



Dwane



Environment

SolrCloud 7.7.2, 30 shards, 2 replicas

~3 qps during peak times

Reply via email to