Sorry to bump this, I have the same issue and was curious about the sanity of trying to work around it.
* I have a constant stream of realtime documents I need to continually index. Sometimes they even overwrite very old documents (by using the same unique ID). * I also have a *huge* backlog of documents I'd like to get into a SolrCloud cluster via Hadoop. I understand that the MERGEINDEXES operation expects me to have unique documents, but is it reasonable at all for me to be able to change that? In a plain Solr instance I can add doc1, then add doc1 again with new fields and the new update "wins" and I assume during segment merges the old update is eventually removed. Does that mean it's possible for me to somehow override a merge policy (or something like that?) to effectively do exactly what my Hadoop conflict-resolver does? I already have logic there that knows how to (1) decide which of 2 duplicate documents to keep and (2) respect and "keep" deletes over anything else. I'd love some pointers at what Solr/Lucene classes to look at if I wanted to try my hand at this. I'm down in Lucene SegmentMerger right now but it seems too low level to understand whatever Solr "knows" about enforcing a single unique ID at merge (and search...? or update...?) time. Thanks! On Tue, Jun 11, 2013 at 11:10 AM, Mark Miller <markrmil...@gmail.com> wrote: > Right - but that sounds a little different than what we were talking about. > > You had brought up the core admin merge cmd that let's you merge an index > into a running Solr cluster. > > We are calling that the golive option in the map reduce indexing code. It > has the limitations we have discussed. > > However, if you are only using map reduce to build indexes, there are > facilities for dealing with duplicate id's - as you see in the > documentation. The merges involved in that are different though - these are > merges that happen as the final index is being constructed by the map > reduce job. The final step is the golive step, where the indexes will be > deployed to the running Solr cluster - this is what uses the core admin > merge command, and if you are doing updates or adds outside of map reduce, > you will face the issues we have discussed. > > > - Mark > > On Jun 11, 2013, at 11:57 AM, James Thomas <jtho...@camstar.com> wrote: > > > FWIW, the Solr included with Cloudera Search, by default, "ignores all > but the most recent document version" during merges. > > The conflict resolution is configurable however. See the documentation > for details. > > > http://www.cloudera.com/content/support/en/documentation/cloudera-search/cloudera-search-documentation-v1-latest.html > > -- see the user guide pdf, " update-conflict-resolver" parameter > > > > James > > > > -----Original Message----- > > From: anirudh...@gmail.com [mailto:anirudh...@gmail.com] On Behalf Of > Anirudha Jadhav > > Sent: Tuesday, June 11, 2013 10:47 AM > > To: solr-user@lucene.apache.org > > Subject: Re: index merge question > > > > From my experience the lucene mergeTool and the one invoked by coreAdmin > is a pure lucene implementation and does not understand the concepts of a > unique Key(solr land concept) > > > > http://wiki.apache.org/solr/MergingSolrIndexes has a cautionary note > at the end > > > > we do frequent index merges for which we externally run map/reduce ( > java code using lucene api's) jobs to merge & validate merged indices with > sources. > > -Ani > > > > On Tue, Jun 11, 2013 at 10:38 AM, Mark Miller <markrmil...@gmail.com> > wrote: > >> Yeah, you have to carefully manage things if you are map/reduce > building indexes *and* updating documents in other ways. > >> > >> If your 'source' data for MR index building is the 'truth', you also > have the option of not doing incremental index merging, and you could > simply rebuild the whole thing every time - of course, depending your > cluster size, that could be quite expensive. > > > >> > >> - Mark > >> > >> On Jun 10, 2013, at 8:36 PM, Jamie Johnson <jej2...@gmail.com> wrote: > >> > >>> Thanks Mark. My question is stemming from the new cloudera search > stuff. > >>> My concern its that if while rebuilding the index someone updates a > >>> doc that update could be lost from a solr perspective. I guess what > >>> would need to happen to ensure the correct information was indexed > >>> would be to record the start time and reindex the information that > changed since then? > >>> On Jun 8, 2013 2:37 PM, "Mark Miller" <markrmil...@gmail.com> wrote: > >>> > >>>> > >>>> On Jun 8, 2013, at 12:52 PM, Jamie Johnson <jej2...@gmail.com> wrote: > >>>> > >>>>> When merging through the core admin ( > >>>>> http://wiki.apache.org/solr/MergingSolrIndexes) what is the policy > >>>>> for conflicts during the merge? So for instance if I am merging > >>>>> core 1 and core 2 into core 0 (first example), what happens if core > >>>>> 1 and core 2 > >>>> both > >>>>> have a document with the same key, say core 1 has a newer version > >>>>> of core 2? Does the merge fail, does the newer document remain? > >>>> > >>>> You end up with both documents, both with that ID - not generally a > >>>> situation you want to end up in. You need to ensure unique id's in > >>>> the input data or replace the index rather than merging into it. > >>>> > >>>>> > >>>>> Also if using the srcCore method if a document with key 1 is > >>>>> written > >>>> while > >>>>> an index also with key 1 is being merged what happens? > >>>> > >>>> It depends on the order I think - if the doc is written after the > >>>> merge and it's an update, it will update the doc that was just > >>>> merged in. If the merge comes second, you have the doc twice and it's > a problem. > >>>> > >>>> - Mark > >> > > > > > > > > -- > > Anirudha P. Jadhav > >