Thanks for the info, James.
I failed to mention in my original message that we're on Solr 3.5 and we are
combining the deletes with our add/updates in the same DIH.
In searching through the archives of this mailing list, I actually found a
thread which described my problem exactly and led me to a solution:
DIH - deleting documents, high performance (delta) imports, and passing
parameters
http://lucene.472066.n3.nabble.com/DIH-deleting-documents-high-performance-delta-imports-and-passing-parameters-td1388349.html
The point of confusion for me was that I was assuming $deleteDocById would
delete the document already in the index, and also prevent that document
from being re-added to the index, but it only deletes the document in the
index, it does not prevent the current row from being re-added. So what
happens is it does delete the document in the index, but then it just gets
re-added by the import operation.
As suggested by the thread I referenced above, I was able to solve the
problem nicely by just issuing both a $deleteDocById command and a $skipDoc
command together, with $deleteDocById deleting the document already in the
index, and $skipDoc preventing it from being re-added by the current row in
the import.
Thanks again for taking the time to respond.
It's great how helpful this list is.
- Peter
-----Original Message-----
From: Dyer, James
Sent: Saturday, March 10, 2012 12:16 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr DIH and $deleteDocById
This (almost) sounds like https://issues.apache.org/jira/browse/SOLR-2492
which was fixed in Solr 3.4 .. Are you on an earlier version?
But maybe not, because you're seeing the # deleted documents increment, and
prior to this bug fix (I think) the deleted counter wasn't getting
incremented either.
Perhaps this is a related bug that only happens when the deletes are added
via a transformer? Try a query like this without a transformer:
select uniqueID as '$deleteDocById' from table where uniqueID =
'1-devpeter-1';
Does this work? If so, you've probably stumbled on a new bug related to
SOLR-2492.
In any case, the workaround (probably) is to manually issue a commit after
doing your deletes. Or, combine your deletes with add/updates in the same
DIH run and it should commit automatically as configured.
James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311
-----Original Message-----
From: Peter Boudreau [mailto:pe...@makeshop.jp]
Sent: Friday, March 09, 2012 2:22 AM
To: solr-user@lucene.apache.org
Subject: Solr DIH and $deleteDocById
Hello everyone,
I've got Solr DIH up and running with no problems as far as importing data,
but I'm now trying to add some functionality to our delta import to delete
invalid records.
The special command $deleteDocById seems to provide what I'm looking for,
and just for testing purposes until I get things working, I setup a simple
transformer to delete just one document with a specific ID:
<script>
<![CDATA[
function deleteBadDocs(row) {
var uniqueID = row.get('unique_id');
if(uniqueID == '1-devpeter-1') {
row.put('$deleteDocById', uniqueID);
}
return row;
}
]]>
</script>
When I run DIH with this, sure enough, it tells me that 1 document was
deleted:
Indexing completed. Added/Updated: 4755 documents. Deleted 1 documents.
But then when I search the index, the document is still there. I've been
googling this for a while now, and found a number of references saying that
you need to commit or optimize after this in order for the deletes to take
effect, but I was under the impression that DIH both commits and optimizes
by default, so shouldn't it be getting committed and optimized automatically
by DIH? I even tried implicitly setting the commit= and optimize= flags to
true, but still, the deleted document was still in the index when I
searched. I also tried restarting Solr, but the deleted document was still
there.
Could anyone help me understand why this document which is being reported as
deleted still shows up in the index?
Also, there is one thing which I'm unclear on after reading the Solr wiki:
$deleteDocById : Delete a doc from Solr with this id. The value has to be
the uniqueKey value of the document. Note that this command can only delete
docs already committed to the index.
I was starting to think that maybe $deleteDocById was only preventing
documents from entering the index, and not deleting existing documents which
were already in the index, but if I understand this correctly,
$deleteDocById should be able to delete a document which was already in the
index *before* running DIH, right?
Any help would be very much appreciated.
Thanks in advance,
Peter