Yes, this is a known issue. Repeatedly running the MapReduceIndexerTool on the same set of input files can result in duplicate entries in the Solr collection. This occurs because currently the tool can only insert documents and cannot update or delete existing Solr documents.
Wolfgang. On May 6, 2014, at 3:08 PM, Costi Muraru <costimur...@gmail.com> wrote: > Hi guys, > > I've used the MapReduceIndexerTool [1] in order to import data into SOLR > and seem to stumbled upon something. I've followed the tutorial [2] and > managed to import data into a SolrCloud cluster using the map reduce job. > I ran the job a second time in order to update some of the existing > documents. The job itself was successful, but the documents maintained the > same field values as before. > In order to update some fields for the existing IDs, I've decompiled the > AVRO sample file > (examples/test-documents/sample-statuses-20120906-141433-medium.avro), > updated some of the fields with new values, while maintaining the same IDs > and packaged the AVRO back. After this I ran the MapReduceIndexerTool and, > although successful, the records were not updated. > I've tried this several times. Even with a few documents the result is the > same - the documents are not being updated with the new values. Instead, > the old field values are kept. > If I manually delete the old document from SOLR and after this I run the > job, the document is inserted with the new values. > > Do you guys have any experience with this tool? Is this something by design > / Am I missing something? Can this behavior be overwritten to force an > update? Any feedback is gladly appreciated. > > Thanks, > Constantin > > [1] > http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html#csug_topic_6_1 > > [2] > http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html