Re: batch-update-pattern, NoMergeScheduler?

2014-12-23 Thread Ian Lea
Hi


I can't give an exact answer to your question but my experience has
been that it's best to leave all the merge/buffer/etc settings alone.
If you are doing a bulk update of a large number of docs then it's no
surprise that you are seeing a heavy IO load.  If you can, it's likely
to be worth giving lucene a dedicated disk or at least make sure
there's as little contention as possible - that's just general advice
for any workload.  There is always going to a limiting factor
somewhere.

You could also experiment with multiple threads, or multiple jobs
writing to separate indexes with a standalone merge at the end.  In my
experience these have generally been more trouble than they're worth,
but the occasions when I do bulk loads of large number of docs are
sufficiently rare that I'm not too bothered how long it takes.


--
Ian.



--
Ian.


On Mon, Dec 22, 2014 at 9:45 AM, Clemens Wyss DEV clemens...@mysign.ch wrote:
 One of our indexes is updated completely quite frequently - batch update 
 or re-index.
 If so more than 2million documents are added/updated to/in the very index. 
 This creates an immense IO load on our system. Does it make sense to set 
 merge scheduler to NoMergeScheduler (and/or MergePolicy to NoMergePolicy). Or 
 is merging not relevant as the commit is done at the very end only?

 Context information:
 At the moment the writer's config consists only of setRAMBufferSizeMB:
 IndexWriterConfig config = new IndexWriterConfig( 
 IndexManager.CURRENT_LUCENE_VERSION, analyzer );
 config.setMergePolicy( NoMergePolicy.NO_COMPOUND_FILES );
 //config.setMergeScheduler( NoMergeScheduler.INSTANCE );
 config.setRAMBufferSizeMB( 20 );

 The update logic is as follows:
 indexWriter.deleteAll()
 ...
 for all elements do {
 ...
 indexWriter.updateDocument( term, doc ); // in order to omit duplicate 
 entries
 ...
 }
 indexWriter.commit

 What is the proposed way to perform such a batch update?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: batch-update-pattern, NoMergeScheduler?

2014-12-23 Thread nnagarajayya
You can try out the TimedSerialMergeScheduler. It allows you to set a merge 
schedule to a time in the evening or after n number of merge requests  ...

http://rankingalgorithm.1050964.n5.nabble.com/TimedSerialMergerScheduler-java-allows-merges-to-be-deferred-to-a-known-time-like-11pm-or-1am-td5706350.html

It is based on the SerialMergeScheduler so will block until the merges 
complete. Like Ian said, parallel merge may do the trick for you especially if 
you can build your index on a very fast io like ssd, etc.

Warm Regards

-Nagendra Nagarajayya
http://solr-ra.tgels.org
http://elasticsearch-ra.tgels.org
http://rankingalgorithm.tgels.org

-Original Message-
From: Ian Lea [mailto:ian@gmail.com] 
Sent: Tuesday, December 23, 2014 2:54 AM
To: java-user@lucene.apache.org
Subject: Re: batch-update-pattern, NoMergeScheduler?

Hi


I can't give an exact answer to your question but my experience has been that 
it's best to leave all the merge/buffer/etc settings alone.
If you are doing a bulk update of a large number of docs then it's no surprise 
that you are seeing a heavy IO load.  If you can, it's likely to be worth 
giving lucene a dedicated disk or at least make sure there's as little 
contention as possible - that's just general advice for any workload.  There is 
always going to a limiting factor somewhere.

You could also experiment with multiple threads, or multiple jobs writing to 
separate indexes with a standalone merge at the end.  In my experience these 
have generally been more trouble than they're worth, but the occasions when I 
do bulk loads of large number of docs are sufficiently rare that I'm not too 
bothered how long it takes.


--
Ian.



--
Ian.


On Mon, Dec 22, 2014 at 9:45 AM, Clemens Wyss DEV clemens...@mysign.ch wrote:
 One of our indexes is updated completely quite frequently - batch update 
 or re-index.
 If so more than 2million documents are added/updated to/in the very index. 
 This creates an immense IO load on our system. Does it make sense to set 
 merge scheduler to NoMergeScheduler (and/or MergePolicy to NoMergePolicy). Or 
 is merging not relevant as the commit is done at the very end only?

 Context information:
 At the moment the writer's config consists only of setRAMBufferSizeMB:
 IndexWriterConfig config = new IndexWriterConfig( 
 IndexManager.CURRENT_LUCENE_VERSION, analyzer ); 
 config.setMergePolicy( NoMergePolicy.NO_COMPOUND_FILES ); 
 //config.setMergeScheduler( NoMergeScheduler.INSTANCE ); 
 config.setRAMBufferSizeMB( 20 );

 The update logic is as follows:
 indexWriter.deleteAll()
 ...
 for all elements do {
 ...
 indexWriter.updateDocument( term, doc ); // in order to omit duplicate 
 entries
 ...
 }
 indexWriter.commit

 What is the proposed way to perform such a batch update?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



batch-update-pattern, NoMergeScheduler?

2014-12-22 Thread Clemens Wyss DEV
One of our indexes is updated completely quite frequently - batch update or 
re-index. 
If so more than 2million documents are added/updated to/in the very index. This 
creates an immense IO load on our system. Does it make sense to set merge 
scheduler to NoMergeScheduler (and/or MergePolicy to NoMergePolicy). Or is 
merging not relevant as the commit is done at the very end only?

Context information:
At the moment the writer's config consists only of setRAMBufferSizeMB:
IndexWriterConfig config = new IndexWriterConfig( 
IndexManager.CURRENT_LUCENE_VERSION, analyzer );
config.setMergePolicy( NoMergePolicy.NO_COMPOUND_FILES );
//config.setMergeScheduler( NoMergeScheduler.INSTANCE );
config.setRAMBufferSizeMB( 20 );

The update logic is as follows:
indexWriter.deleteAll()
...
for all elements do {
...
indexWriter.updateDocument( term, doc ); // in order to omit duplicate entries
...
}
indexWriter.commit

What is the proposed way to perform such a batch update?