Re: [lucy-user] C library, how to check index is healthy

Nick Wellnhofer Wed, 01 Mar 2017 03:34:34 -0800

On 28/02/2017 20:17, Serkan Mulayim wrote:

So as I see:
1- when we do indexing operation in an existing index, a new segment is
created and it is not put into the index until it is committed. When it is
committed, its segment is kept separately and the snapshot.json file is
updated to include the new segment.


That's right, but segments are merged occasionally.

2- lock files are being generated and are kept separate based on the pid
(no shared FS adjustments).

What I would like to do is, to be able to index thousands of documents in
batches with asynchronous calls to the library. Asynchronous calls will try
to update the newly created segment to be written by different calls. If
PIDs are the same, it seems like system will crash due to write.lock
containing the PIDs.

This has nothing to do with PIDs (they're only used to remove stale lockfiles). You'll receive a LockErr exception if an Indexer can't acquire thewrite lock after several retries regardless of the process ID.

Do you think there is a way to make this work with
calls from different PIDs, with an addition of commit.lock file? I hope
this makes sense :( :)

Parallel indexing isn't supported by Lucy. We only support background mergingwhich is mostly geared towards interactive applications that only index a fewdocuments at a time. Non-interactive batch jobs that index thousands ofdocuments in parallel aren't handled well by Lucy, although this couldprobably be improved. Your only options right now are:


- If it's OK for your indexing processes to potentially wait for a long
  time, increase the write lock timeout to a huge value or catch LockErrs
  and implement your own retry logic.

- Implement your own document queue where multiple processes can add
  documents and a single indexing process removes them.

One more question is when I index documents and commit each time (let's say
5000 batches of commits in synchronous way), I see that the indexing works
fine. How are the segments being handled. I do not see that 5000 different
segments created. Is it because after a certain number of segments (say
32), the segments are being merged and optimized?


Yes, that's how it works. The FastUpdates cookbook entry contains more details:

    https://lucy.apache.org/docs/c/Lucy/Docs/Cookbook/FastUpdates.html

But I don't think background merging would help much in your case.

Nick

Re: [lucy-user] C library, how to check index is healthy

Reply via email to