Personally, I much prefer indexing from an independent SolrJ client to using DIH when I have to take explicit control of errors & etc. Here's an example: https://lucidworks.com/blog/indexing-with-solrj/
In your example, you seem to be assuming that the Lucene IDs (and here I'm assuming you're not talking about the internal Lucene ID) corresponds to some kind of primary key in your database table. But the correspondence isn't necessarily straightforward, how would it handle composite keys? I'll leave actual comments on DIH's internals to people who, you know, actually understand the code ;)... Erick On Fri, Feb 20, 2015 at 2:32 AM, SolrUser1543 <osta...@gmail.com> wrote: > Relatively frequently (about a once a month) we need to reindex the data, by > using DIH and copying the data from one index to another. > Because of the fact that we have a large index, it could take from 12 to 24 > hours to complete. At the same time the old index is being queried by users. > Sometimes DIH could be interrupted at the middle, because of some unexpected > exception caused by OutOfMemory or something else (many times it failed when > more than 90 % was completed). > More than this, almost every time, some items are missing at new the index. > It is very complicated to find them. > At this stage I can't be sure about what documents exactly were missed and I > have to do it again and waiting for many hours. At the same time the old > index constantly receives new items. > > I want to suggest the following way to solve the problem: > • Get list of all item ids ( call LUCINE API , like CLUE does for > example ) > • Start DIH , which will iterate over those ids and each time make a > query for n items. > 1. Of course original DIH class should be changed to support it. > • This will give the following advantages : > 1. I will know exactly what items were failed. > 2. I can restart the process from any point and in case of DIH failure > restart it from the point of failure. > > > so the main difference will be that now DIH running on *:* query and I > suggest to run it list of IDS > > for example if I have 1000 docs and want that this new DIH will take each > time 100 docs , so it will do 10 queries , each one will have 100 IDS . ( > like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... ) > > The question is what do you think about it? Or all of this could be done > another way and I am trying to reinvent the wheel? > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html > Sent from the Solr - User mailing list archive at Nabble.com.