Thanks for the info, James.
I failed to mention in my original message that we're on Solr 3.5 and we are combining the deletes with our add/updates in the same DIH.

In searching through the archives of this mailing list, I actually found a thread which described my problem exactly and led me to a solution:

DIH - deleting documents, high performance (delta) imports, and passing parameters
http://lucene.472066.n3.nabble.com/DIH-deleting-documents-high-performance-delta-imports-and-passing-parameters-td1388349.html

The point of confusion for me was that I was assuming $deleteDocById would delete the document already in the index, and also prevent that document from being re-added to the index, but it only deletes the document in the index, it does not prevent the current row from being re-added. So what happens is it does delete the document in the index, but then it just gets re-added by the import operation.

As suggested by the thread I referenced above, I was able to solve the problem nicely by just issuing both a $deleteDocById command and a $skipDoc command together, with $deleteDocById deleting the document already in the index, and $skipDoc preventing it from being re-added by the current row in the import.

Thanks again for taking the time to respond.
It's great how helpful this list is.

- Peter

-----Original Message----- From: Dyer, James
Sent: Saturday, March 10, 2012 12:16 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr DIH and $deleteDocById

This (almost) sounds like https://issues.apache.org/jira/browse/SOLR-2492 which was fixed in Solr 3.4 .. Are you on an earlier version?

But maybe not, because you're seeing the # deleted documents increment, and prior to this bug fix (I think) the deleted counter wasn't getting incremented either.

Perhaps this is a related bug that only happens when the deletes are added via a transformer? Try a query like this without a transformer:

select uniqueID as '$deleteDocById' from table where uniqueID = '1-devpeter-1';

Does this work? If so, you've probably stumbled on a new bug related to SOLR-2492.

In any case, the workaround (probably) is to manually issue a commit after doing your deletes. Or, combine your deletes with add/updates in the same DIH run and it should commit automatically as configured.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Peter Boudreau [mailto:pe...@makeshop.jp]
Sent: Friday, March 09, 2012 2:22 AM
To: solr-user@lucene.apache.org
Subject: Solr DIH and $deleteDocById

Hello everyone,

I've got Solr DIH up and running with no problems as far as importing data, but I'm now trying to add some functionality to our delta import to delete invalid records.

The special command $deleteDocById seems to provide what I'm looking for, and just for testing purposes until I get things working, I setup a simple transformer to delete just one document with a specific ID:

<script>
<![CDATA[
   function deleteBadDocs(row) {
       var uniqueID = row.get('unique_id');
       if(uniqueID == '1-devpeter-1') {
           row.put('$deleteDocById', uniqueID);
       }
       return row;
   }
]]>
</script>

When I run DIH with this, sure enough, it tells me that 1 document was deleted:

Indexing completed. Added/Updated: 4755 documents. Deleted 1 documents.

But then when I search the index, the document is still there. I've been googling this for a while now, and found a number of references saying that you need to commit or optimize after this in order for the deletes to take effect, but I was under the impression that DIH both commits and optimizes by default, so shouldn't it be getting committed and optimized automatically by DIH? I even tried implicitly setting the commit= and optimize= flags to true, but still, the deleted document was still in the index when I searched. I also tried restarting Solr, but the deleted document was still there.

Could anyone help me understand why this document which is being reported as deleted still shows up in the index?

Also, there is one thing which I'm unclear on after reading the Solr wiki:

$deleteDocById : Delete a doc from Solr with this id. The value has to be the uniqueKey value of the document. Note that this command can only delete docs already committed to the index.

I was starting to think that maybe $deleteDocById was only preventing documents from entering the index, and not deleting existing documents which were already in the index, but if I understand this correctly, $deleteDocById should be able to delete a document which was already in the index *before* running DIH, right?

Any help would be very much appreciated.

Thanks in advance,

Peter

Reply via email to