We're indexing a potentially large collection of documentsinto smaller
subgroups we call "collections".  Each document
has a field that identifies the collection it belongs to, in addition
to a unique document id field:

<add>
   <doc>
      <field name="id">foo-1</field>
      <field name="collection">foo</field>
      ......
   </doc>
   <doc>
      <field name="id">foo-2</field>
      <field name="collection">foo</field>
      .....
   </doc>

   ..... etc.
</add>

"collection" and "id" are defined in schema.xml as string fields.

When a collection is being added to the index, it's possible that
there is an existing "foo" collection in the index that needs to be
replaced.  The ids in the new collection will reuse many of the ids
in the old collection, but the replacement is not a document-for-document
replacement process -- there may be more or less documents
in the new collection.

So the replacement operation goes as follows:

<delete>
   <query>collection:foo</query>
</delete>
<commit waitFlush="true" waitSearcher="true" />
<add>
   <doc>
      .....
</add>
<commit waitFlush="true" waitSearcher="true" />

Each of these XML commands happens on a separate HTTP connection.
If the collection doesn't already exist in the index, then the delete
is essentially a noop.

Finally, here's the behavior we're seeing.  In some cases, usually when
the index is starting to get larger (approaching 500,000 documents),
the above procedure will fail to add anything to the index.  That is, none
of the commands return an error code, there is no indication of a problem
in the log files and the process DOES take some amount of time to
complete.  But at the end of the process, there are no documents in
the index whose collection is "foo".  This can happen whether or not
there is an existing "foo" collection already in the index -- in fact, the
typical case is that there is not.

So my question is:  Is there any chance that the delete, commit, and add
commands are interacting in such a way as to cause the add to happen
before the delete so that the add is just replacing the existing "foo"
documents and then the delete is coming along and deleting everything?

My understanding is that the wait attributes to the commit command should
flush the delete out to the index before the add can start but I have
no knowledge of the true sequencing of events in either Solr or Lucene.

If this is happening, how can I know when the delete has been processed
before initiating the add process?

Thanks,

Patrick Johnstone

Reply via email to