Relatively frequently (about a once a month) we need to reindex the data, by
using DIH and copying the data from one index to another.
Because of the fact that we have a large index, it could take from 12 to 24
hours to complete. At the same time the old index is being queried by users. 
Sometimes DIH could be interrupted at the middle, because of some unexpected
exception caused by OutOfMemory or something else (many times it failed when
more than 90 % was completed). 
More than this, almost every time, some items are missing at new the  index.
It is very complicated to find them. 
At this stage I can't be sure about what documents exactly were missed and I
have to do it again and waiting for many hours. At the same time the old
index constantly receives new items. 

I want to suggest the following way to solve the problem: 
•       Get list of all item ids ( call LUCINE API , like CLUE does for example 
) 
•       Start DIH    , which will iterate over those ids and each time make a
query for n items.
1.      Of course original DIH class should be changed to support it. 
•       This will give the following advantages : 
1.      I will know exactly what items were failed.
2.      I can restart the process from any point and in case of DIH failure
restart it from the point of failure.


so the main difference will be that now DIH running on *:* query and I
suggest to run it list of IDS 

for example if I have 1000 docs and want that this new DIH will take each
time 100 docs , so it will do 10 queries , each one will have 100 IDS . (
like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... ) 

The question is what do you think about it? Or all of this could be done
another way and I am trying to reinvent the wheel?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to