Lucene update performance

2017-05-09 Thread Kudrettin Güleryüz
Hi, For a 5.2.1 index that contains around 1.2 million documents, updating the index with 1.3 million files seems to take 3X longer than doing a scratch indexing. (Files are crawled over NFS, indexes are stored on a mechanical disk locally (Btrfs)) Is this expected for Lucene's update index logic

Re: Lucene update performance

2017-05-09 Thread Rob Audenaerde
Do you update each entire document? (vs updating numeric docvalues?) That is implemented as 'delete and add' so I guess that will be slower than clean sheet indexing. Not sure if it is 3x slower, that seems a bit much? On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz wrote: > Hi, > > For a 5.

Re: Lucene update performance

2017-05-09 Thread Kudrettin Güleryüz
I do update the entire document each time. Furthermore, this sometimes means deleting compressed archives which are stores as multiple documents for each compressed archive file and readding them. Is there an update method, is it better performance than remove then add? I was simply removing modif

Re: Lucene update performance

2017-05-09 Thread Rob Audenaerde
As far as I know, the updateDocument method on the IndexWriter delete and add. See also the javadoc: [..] Updates a document by first deleting the document(s) containing term and then adding the new document. The delete and then add are atomic as seen by a reader on the same index (fl

Re: Lucene update performance

2017-05-09 Thread Adrien Grand
addDocument can be a significant gain compared to updateDocument as doing a PK lookup on a unique field has a cost that is not negligible compared to indexing a document, especially if the indexing chain is simple (no large text fields with complex analyzers). Reindexing in place will also cause mo

Re: Lucene update performance

2017-05-09 Thread Kudrettin Güleryüz
Fair enough, however, I see this: $ cat log Tue May 9 07:19:45 EDT 2017: Indexing starts Tue May 9 07:32:33 EDT 2017: Deletion starts with a list of 1278635 files Tue May 9 07:49:47 EDT 2017: Deletion complete, Addition starts with 1272334 files $ date Tue May 9 13:12:58 EDT 2017 I am using t

Re: Lucene update performance

2017-05-10 Thread Michael McCandless
IndexWriter simply buffers that Query you passed to deleteDocuments, so that's very fast. Only later on will it (lazily) resolve that Query to the docIDs to delete, which is the costly part, when a merge wants to kick off, or a refresh, or a commit. What Query are you using to identify documents

Re: Lucene update performance

2017-05-10 Thread Kudrettin Güleryüz
I see, makes better sense now. The query is a BooleanQuery. Here is what I do: https://gist.github.com/Kudret/56879bf30fa129e752895305e1db5a80 On Wed, May 10, 2017 at 1:31 PM Michael McCandless < luc...@mikemccandless.com> wrote: > IndexWriter simply buffers that Query you passed to deleteDo