Re: batch-update-pattern, NoMergeScheduler?
Hi I can't give an exact answer to your question but my experience has been that it's best to leave all the merge/buffer/etc settings alone. If you are doing a bulk update of a large number of docs then it's no surprise that you are seeing a heavy IO load. If you can, it's likely to be worth giving lucene a dedicated disk or at least make sure there's as little contention as possible - that's just general advice for any workload. There is always going to a limiting factor somewhere. You could also experiment with multiple threads, or multiple jobs writing to separate indexes with a standalone merge at the end. In my experience these have generally been more trouble than they're worth, but the occasions when I do bulk loads of large number of docs are sufficiently rare that I'm not too bothered how long it takes. -- Ian. -- Ian. On Mon, Dec 22, 2014 at 9:45 AM, Clemens Wyss DEV clemens...@mysign.ch wrote: One of our indexes is updated completely quite frequently - batch update or re-index. If so more than 2million documents are added/updated to/in the very index. This creates an immense IO load on our system. Does it make sense to set merge scheduler to NoMergeScheduler (and/or MergePolicy to NoMergePolicy). Or is merging not relevant as the commit is done at the very end only? Context information: At the moment the writer's config consists only of setRAMBufferSizeMB: IndexWriterConfig config = new IndexWriterConfig( IndexManager.CURRENT_LUCENE_VERSION, analyzer ); config.setMergePolicy( NoMergePolicy.NO_COMPOUND_FILES ); //config.setMergeScheduler( NoMergeScheduler.INSTANCE ); config.setRAMBufferSizeMB( 20 ); The update logic is as follows: indexWriter.deleteAll() ... for all elements do { ... indexWriter.updateDocument( term, doc ); // in order to omit duplicate entries ... } indexWriter.commit What is the proposed way to perform such a batch update? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: batch-update-pattern, NoMergeScheduler?
You can try out the TimedSerialMergeScheduler. It allows you to set a merge schedule to a time in the evening or after n number of merge requests ... http://rankingalgorithm.1050964.n5.nabble.com/TimedSerialMergerScheduler-java-allows-merges-to-be-deferred-to-a-known-time-like-11pm-or-1am-td5706350.html It is based on the SerialMergeScheduler so will block until the merges complete. Like Ian said, parallel merge may do the trick for you especially if you can build your index on a very fast io like ssd, etc. Warm Regards -Nagendra Nagarajayya http://solr-ra.tgels.org http://elasticsearch-ra.tgels.org http://rankingalgorithm.tgels.org -Original Message- From: Ian Lea [mailto:ian@gmail.com] Sent: Tuesday, December 23, 2014 2:54 AM To: java-user@lucene.apache.org Subject: Re: batch-update-pattern, NoMergeScheduler? Hi I can't give an exact answer to your question but my experience has been that it's best to leave all the merge/buffer/etc settings alone. If you are doing a bulk update of a large number of docs then it's no surprise that you are seeing a heavy IO load. If you can, it's likely to be worth giving lucene a dedicated disk or at least make sure there's as little contention as possible - that's just general advice for any workload. There is always going to a limiting factor somewhere. You could also experiment with multiple threads, or multiple jobs writing to separate indexes with a standalone merge at the end. In my experience these have generally been more trouble than they're worth, but the occasions when I do bulk loads of large number of docs are sufficiently rare that I'm not too bothered how long it takes. -- Ian. -- Ian. On Mon, Dec 22, 2014 at 9:45 AM, Clemens Wyss DEV clemens...@mysign.ch wrote: One of our indexes is updated completely quite frequently - batch update or re-index. If so more than 2million documents are added/updated to/in the very index. This creates an immense IO load on our system. Does it make sense to set merge scheduler to NoMergeScheduler (and/or MergePolicy to NoMergePolicy). Or is merging not relevant as the commit is done at the very end only? Context information: At the moment the writer's config consists only of setRAMBufferSizeMB: IndexWriterConfig config = new IndexWriterConfig( IndexManager.CURRENT_LUCENE_VERSION, analyzer ); config.setMergePolicy( NoMergePolicy.NO_COMPOUND_FILES ); //config.setMergeScheduler( NoMergeScheduler.INSTANCE ); config.setRAMBufferSizeMB( 20 ); The update logic is as follows: indexWriter.deleteAll() ... for all elements do { ... indexWriter.updateDocument( term, doc ); // in order to omit duplicate entries ... } indexWriter.commit What is the proposed way to perform such a batch update? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
batch-update-pattern, NoMergeScheduler?
One of our indexes is updated completely quite frequently - batch update or re-index. If so more than 2million documents are added/updated to/in the very index. This creates an immense IO load on our system. Does it make sense to set merge scheduler to NoMergeScheduler (and/or MergePolicy to NoMergePolicy). Or is merging not relevant as the commit is done at the very end only? Context information: At the moment the writer's config consists only of setRAMBufferSizeMB: IndexWriterConfig config = new IndexWriterConfig( IndexManager.CURRENT_LUCENE_VERSION, analyzer ); config.setMergePolicy( NoMergePolicy.NO_COMPOUND_FILES ); //config.setMergeScheduler( NoMergeScheduler.INSTANCE ); config.setRAMBufferSizeMB( 20 ); The update logic is as follows: indexWriter.deleteAll() ... for all elements do { ... indexWriter.updateDocument( term, doc ); // in order to omit duplicate entries ... } indexWriter.commit What is the proposed way to perform such a batch update?