Re: Concurrent Indexing

Umashanker, Srividhya Fri, 20 Jun 2014 22:50:21 -0700

We do have a way to recover partially with a version number for each 
transaction. The same version maintained in lucene as one document. During 
startup these numbers define what has to be syncd up. Unfortunately lucene is 
used in a webapp, so this happens "only" during a jetty restart.


- Vidhya


> On 21-Jun-2014, at 11:08 am, "Vitaly Funstein" <vfunst...@gmail.com> wrote:
> 
> This is a better idea than what you had before, but I don't think there's
> any point in doing any commits manually at all unless you have a way of
> detecting and recovering exactly the data that hasn't been committed. In
> other words, what difference does it make whether you lost 1 index record
> or 1M, if you can't determine which records were lost and need to reindex
> everything from the start anyway, to ensure consistency between SOR and
> Lucene?
> 
> 
> 
> 
> On Fri, Jun 20, 2014 at 10:20 PM, Umashanker, Srividhya <
> srividhya.umashan...@hp.com> wrote:
> 
>> Let me try with the NRT and periodic commit  say every 5 mins in a
>> committer thread on need basis.
>> 
>> Is there a threshold limit on how long we can go without committing ? I
>> think the buffers get flushed to disk but not to crash proof on disk. So we
>> should be good on memory.
>> 
>> I should also verify if the time taken for commit() is longer when more
>> data piled up to commit.  But definitely should be better than  committing
>> for every thread..
>> 
>> Will post back after tests.
>> 
>> - Vidhya
>> 
>> 
>>> On 21-Jun-2014, at 10:28 am, "Vitaly Funstein" <vfunst...@gmail.com>
>> wrote:
>>> 
>>> Hmm, I might have actually given you a slightly incorrect explanation wrt
>>> what happens when internal buffers fill up. There will definitely be a
>>> flush of the buffer, and segment files will be written to, but it's not
>>> actually considered a full commit, i.e. an external reader will not see
>>> these changes (yet). The exact details elude me but there are quite a few
>>> threads here on what happens during a commit (vs a flush). However, when
>>> you call IndexWriter.close() a commit will definitely happen.
>>> 
>>> But in any event, if you use an NRT reader to search, then it shouldn't
>>> matter to you when the commit actually takes place. Such readers also
>>> search uncommitted changes as well as those already on disk. If data
>>> durability is not a requirement for you, if i.e. you can (and probably
>> do)
>>> reindex your data from SOR on startup, then not doing commits yourself
>> may
>>> be the way to go. Or perhaps you could reduce the amount of data you need
>>> to reindex and still call commit() yourself periodically though not for
>>> every write transaction, but maybe introduce some watermarking logic
>>> whereby you detect the highest watermark committed to Lucene. Then
>> reindex
>>> only the data from the DB from that point onward (meaning only
>> uncommitted
>>> data is lost and needs to be recovered, but you can figure out exactly
>>> where that point is).
>>> 
>>> 
>>> 
>>> On Fri, Jun 20, 2014 at 8:02 PM, Umashanker, Srividhya <
>>> srividhya.umashan...@hp.com> wrote:
>>> 
>>>> It is non transactional. We first write the same data to database in a
>>>> transaction and then call writer addDocument.  If lucene fails we still
>>>> hold the data to recover.
>>>> 
>>>> I can avoid the commit if we use NRT reader. We do need this to be
>>>> searchable immediately.
>>>> 
>>>> Another question. I did try removing commit() in each thread and wait
>> for
>>>> lucene to auto commit with maxBufferedDocs set to 100 and
>> ramBufferedSize
>>>> set to high values, so docs triggers first. But did not see the 1st 100
>>>> docs data in lucene even after 500 docs.
>>>> 
>>>> Is there a way for me to see when lucene auto commits?
>>>> 
>>>> If we tune the auto commit parameters appropriately, do i still need the
>>>> committer thread ? Because it's job is to call commit. Anyway
>>>> add/updateDocument is already done in my writer threads.
>>>> 
>>>> Thanks for your time and your suggestions!
>>>> 
>>>> - Vidhya
>>>> 
>>>> 
>>>>> On 21-Jun-2014, at 12:09 am, "Vitaly Funstein" <vfunst...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> You could just avoid calling commit() altogether if your application's
>>>>> semantics allow this (i.e. it's non-transactional in nature). This way,
>>>>> Lucene will do commits when appropriate, based on the buffering
>> settings
>>>>> you chose. It's generally unnecessary and undesirable to call commit at
>>>> the
>>>>> end of each write, unless you see to provide strict durability
>> guarantees
>>>>> in your system.
>>>>> 
>>>>> If you must acknowledge every write after it's been committed, set up a
>>>>> single committer thread that does this when there are any work tasks in
>>>> the
>>>>> queue. Then add to that queue from your writer threads...
>>>>> 
>>>>> 
>>>>> On Fri, Jun 20, 2014 at 8:47 AM, Umashanker, Srividhya <
>>>>> srividhya.umashan...@hp.com> wrote:
>>>>> 
>>>>>> Lucene Experts -
>>>>>> 
>>>>>> Recently we upgraded to Lucene 4. We want to make use of concurrent
>>>>>> flushing feature Of Lucene.
>>>>>> 
>>>>>> Indexing for us includes certain db operations and writing to lucene
>>>> ended
>>>>>> by commit.  There may be multiple concurrent calls to Indexer to
>> publish
>>>>>> single/multiple records.
>>>>>> 
>>>>>> So far, with older version of lucene, we had our indexing synchronized
>>>> (1
>>>>>> thread indexing).
>>>>>> Which means waiting time is more, based on concurrency and execution
>>>> time.
>>>>>> 
>>>>>> We are moving away from the Synchronized indexing. Which is actually
>> to
>>>>>> cut down the waiting period.  Trying to find out if we have to limit
>> the
>>>>>> number of threads that adds document and commits.
>>>>>> 
>>>>>> Below are the tests - to publish just 1000 records with 3 text fields.
>>>>>> 
>>>>>> Java 7 , JVM config :  -XX:MaxPermSize=384M
>>>>>> -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m -XX:MaxNewSize=100m
>>>>>> -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier
>>>>>> -Djsse.enableSNIExtension=false
>>>>>> 
>>>>>> IndexConfiguration being default : We also tried with changes in
>>>>>> maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Min time  in ms
>>>>>> 
>>>>>> Max time ms
>>>>>> 
>>>>>> Avg time ms
>>>>>> 
>>>>>> 1 thread -commit
>>>>>> 
>>>>>> 65
>>>>>> 
>>>>>> 267
>>>>>> 
>>>>>> 85
>>>>>> 
>>>>>> 1 thread -updateDocument
>>>>>> 
>>>>>> 0
>>>>>> 
>>>>>> 40
>>>>>> 
>>>>>> 1
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 6 thread-commit
>>>>>> 
>>>>>> 83
>>>>>> 
>>>>>> 1449
>>>>>> 
>>>>>> 552.42
>>>>>> 
>>>>>> 6 thread- updateDocument
>>>>>> 
>>>>>> 0
>>>>>> 
>>>>>> 175
>>>>>> 
>>>>>> 1.5
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 10 thread -Commit
>>>>>> 
>>>>>> 154
>>>>>> 
>>>>>> 2429
>>>>>> 
>>>>>> 874
>>>>>> 
>>>>>> 10 thread- updateDocument
>>>>>> 
>>>>>> 0
>>>>>> 
>>>>>> 243
>>>>>> 
>>>>>> 1.9
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 20 thread -commit
>>>>>> 
>>>>>> 76
>>>>>> 
>>>>>> 4351
>>>>>> 
>>>>>> 1622
>>>>>> 
>>>>>> 20 thread - updateDocument
>>>>>> 
>>>>>> 0
>>>>>> 
>>>>>> 326
>>>>>> 
>>>>>> 2.1
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> More the threads trying to write to lucene, the updateDocument and
>>>>>> commit() are becoming bottlenecks.  In the above table, 10 and 20
>>>> threads
>>>>>> have an average of 1.5 sec for 1000 commits.
>>>>>> 
>>>>>> Is there some configuration of suggestions to tune the performance of
>>>> the
>>>>>> 2 methods, so that our service performs better, with more concurrency?
>>>>>> 
>>>>>> -vidhya
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>> 
>>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Concurrent Indexing

Reply via email to