[jira] Commented: (LUCENE-1541) Trie range - make trie range indexing more flexible
[ https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678045#action_12678045 ] Ning Li commented on LUCENE-1541: - An index size comparison will be great. > Trie range - make trie range indexing more flexible > --- > > Key: LUCENE-1541 > URL: https://issues.apache.org/jira/browse/LUCENE-1541 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.9 >Reporter: Ning Li >Assignee: Uwe Schindler >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1541.patch, LUCENE-1541.patch > > > In the current trie range implementation, a single precision step is > specified. With a large precision step (say 8), a value is indexed in fewer > terms (8) but the number of terms for a range can be large. With a small > precision step (say 2), the number of terms for a range is smaller but a > value is indexed in more terms (32). > We want to add an option that different precision steps can be set for > different precisions. An expert can use this option to keep the number of > terms for a range small and at the same time index a value in a small number > of terms. See the discussion in LUCENE-1470 that results in this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1541) Trie range - make trie range indexing more flexible
[ https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675390#action_12675390 ] Ning Li commented on LUCENE-1541: - When one precision step is given, it is converted to the representation. Then no array creation is necessary. But something like TrieUtils.FieldConfiguration would be better. Besides the field name and the precision steps, either it should also contain a type (long/int) or a subclass is created for each type. It can be used both at indexing time and at query time. > Trie range - make trie range indexing more flexible > --- > > Key: LUCENE-1541 > URL: https://issues.apache.org/jira/browse/LUCENE-1541 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.9 >Reporter: Ning Li >Assignee: Uwe Schindler >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1541.patch > > > In the current trie range implementation, a single precision step is > specified. With a large precision step (say 8), a value is indexed in fewer > terms (8) but the number of terms for a range can be large. With a small > precision step (say 2), the number of terms for a range is smaller but a > value is indexed in more terms (32). > We want to add an option that different precision steps can be set for > different precisions. An expert can use this option to keep the number of > terms for a range small and at the same time index a value in a small number > of terms. See the discussion in LUCENE-1470 that results in this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1541) Trie range - make trie range indexing more flexible
[ https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675212#action_12675212 ] Ning Li commented on LUCENE-1541: - If you are *really* concerned with the additional loop and the additional array allocations, a long can be used to represent the precision steps. For example, precision steps 2-2-2-2-8-8-8-8-8-16 are represented as 0x80008080808080aa. Then bitCount, shift and numberOfTrailingZeros can be used to determine the length of the trie array and the individual precision steps. Hmm, we still have to support Java 1.4? > Trie range - make trie range indexing more flexible > --- > > Key: LUCENE-1541 > URL: https://issues.apache.org/jira/browse/LUCENE-1541 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.9 >Reporter: Ning Li >Assignee: Uwe Schindler >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1541.patch > > > In the current trie range implementation, a single precision step is > specified. With a large precision step (say 8), a value is indexed in fewer > terms (8) but the number of terms for a range can be large. With a small > precision step (say 2), the number of terms for a range is smaller but a > value is indexed in more terms (32). > We want to add an option that different precision steps can be set for > different precisions. An expert can use this option to keep the number of > terms for a range small and at the same time index a value in a small number > of terms. See the discussion in LUCENE-1470 that results in this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1541) Trie range - make trie range indexing more flexible
Trie range - make trie range indexing more flexible --- Key: LUCENE-1541 URL: https://issues.apache.org/jira/browse/LUCENE-1541 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Ning Li Priority: Minor In the current trie range implementation, a single precision step is specified. With a large precision step (say 8), a value is indexed in fewer terms (8) but the number of terms for a range can be large. With a small precision step (say 2), the number of terms for a range is smaller but a value is indexed in more terms (32). We want to add an option that different precision steps can be set for different precisions. An expert can use this option to keep the number of terms for a range small and at the same time index a value in a small number of terms. See the discussion in LUCENE-1470 that results in this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1470) Add TrieRangeFilter to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674248#action_12674248 ] Ning Li commented on LUCENE-1470: - Agree. Do you want to open a new issue? If you want, I can take a crack at it, but probably sometime next week. > Add TrieRangeFilter to contrib > -- > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470-readme.patch, LUCENE-1470-revamp.patch, LUCENE-1470-revamp.patch, > LUCENE-1470-revamp.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, trie.zip, TrieRangeFilter.java, TrieUtils.java, > TrieUtils.java, TrieUtils.java, TrieUtils.java, TrieUtils.java > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1470) Add TrieRangeFilter to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674051#action_12674051 ] Ning Li commented on LUCENE-1470: - Hi Uwe, I had something similar in mind when I said we can "make things more flexible". Do you think it'll be too complex for users to specify? On the other hand, this is for experts so let experts have all the flexibility. :) We can open a different JIRA issue if we decide to go for it. > Add TrieRangeFilter to contrib > -- > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470-readme.patch, LUCENE-1470-revamp.patch, LUCENE-1470-revamp.patch, > LUCENE-1470-revamp.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, trie.zip, TrieRangeFilter.java, TrieUtils.java, > TrieUtils.java, TrieUtils.java, TrieUtils.java, TrieUtils.java > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1470) Add TrieRangeFilter to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673912#action_12673912 ] Ning Li commented on LUCENE-1470: - Good stuff! Is it worth to also have an option to specify the number of precisions to index a value? With a large precision step (say 8), a value is indexed in fewer terms (8) but the number of terms for a range can be large. With a small precision step (say 2), the number of terms for a range is smaller but a value is indexed in more terms (32). With precision step 2 and number of precisions set to 24, the number of terms for a range is still quite small but a value is indexed in 24 terms instead of 32. For applications usually query small ranges, the number of precisions can be further reduced. We can provide more options to make things more flexible. But we probably want a balance of flexibility vs. the complexity of user options. Does this number of precisions look like a good one? > Add TrieRangeFilter to contrib > -- > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470-readme.patch, LUCENE-1470-revamp.patch, LUCENE-1470-revamp.patch, > LUCENE-1470-revamp.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, trie.zip, TrieRangeFilter.java, TrieUtils.java, > TrieUtils.java, TrieUtils.java, TrieUtils.java, TrieUtils.java > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system
[ https://issues.apache.org/jira/browse/LUCENE-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628025#action_12628025 ] Ning Li commented on LUCENE-532: Is the use of seek and write in ChecksumIndexOutput making Lucene less likely to support all sequential write (i.e. no seek write)? ChecksumIndexOutput is currently used by SegmentInfos. > [PATCH] Indexing on Hadoop distributed file system > -- > > Key: LUCENE-532 > URL: https://issues.apache.org/jira/browse/LUCENE-532 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 1.9 >Reporter: Igor Bolotin >Priority: Minor > Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, > TermInfosWriter.patch > > > In my current project we needed a way to create very large Lucene indexes on > Hadoop distributed file system. When we tried to do it directly on DFS using > Nutch FsDirectory class - we immediately found that indexing fails because > DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason > for this behavior is clear - DFS does not support random updates and so > seek() method can't be supported (at least not easily). > > Well, if we can't support random updates - the question is: do we really need > them? Search in the Lucene code revealed 2 places which call > IndexOutput.seek() method: one is in TermInfosWriter and another one in > CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the > only place that concerned us was in TermInfosWriter. > > TermInfosWriter uses IndexOutput.seek() in its close() method to write total > number of terms in the file back into the beginning of the file. It was very > simple to change file format a little bit and write number of terms into last > 8 bytes of the file instead of writing them into beginning of file. The only > other place that should be fixed in order for this to work is in > SegmentTermEnum constructor - to read this piece of information at position = > file length - 8. > > With this format hack - we were able to use FsDirectory to write index > directly to DFS without any problems. Well - we still don't index directly to > DFS for performance reasons, but at least we can build small local indexes > and merge them into the main index on DFS without copying big main index back > and forth. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit
[ https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626158#action_12626158 ] Ning Li commented on LUCENE-1335: - Maybe this should be a separate JIRA issue. In doWait(), the comment says "as a defense against thread timing hazards where notifyAll() falls to be called, we wait for at most 1 second..." In some cases, it seems that notifyAll() simply isn't called, such as some of the cases related to runningMerges. Maybe we should take a closer look at and possibly simplify the concurrency control in IndexWriter, especially when autoCommit is disabled? > Correctly handle concurrent calls to addIndexes, optimize, commit > - > > Key: LUCENE-1335 > URL: https://issues.apache.org/jira/browse/LUCENE-1335 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3, 2.3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1335.patch, LUCENE-1335.patch, LUCENE-1335.patch, > LUCENE-1335.patch, LUCENE-1335.patch > > > Spinoff from here: > > http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL > PROTECTED] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit
[ https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625455#action_12625455 ] Ning Li commented on LUCENE-1335: - > I don't think so: with autoCommit=true, the merges calls commit(long) > after finishing, and I think we want those commit calls to run > concurrently? After we disable autoCommit, all commit calls will be serialized, right? > What'll happen is the BG merge will hit an exception, roll itself > back, and then the FG thread will pick up the merge and try again. > Likely it'll hit the same exception, which is then thrown back to the > caller. It may not hit an exception, eg say it was disk full: the BG > merge was probably trying to merge 10 segments, whereas the FG merge > is just copying over the 1 segment. So it may complete successfully > too. Back to the issue of running an external merge in BG or FG. In ConcurrentMergeScheduler.merge, an external merge is run in FG, not in BG. But in ConcurrentMergeScheduler.MergeThread.run, whether a merge is external is no longer checked. Why this difference? > Correctly handle concurrent calls to addIndexes, optimize, commit > - > > Key: LUCENE-1335 > URL: https://issues.apache.org/jira/browse/LUCENE-1335 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3, 2.3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1335.patch, LUCENE-1335.patch, LUCENE-1335.patch, > LUCENE-1335.patch > > > Spinoff from here: > > http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL > PROTECTED] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit
[ https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625078#action_12625078 ] Ning Li commented on LUCENE-1335: - > It's because commit() calls prepareCommit(), which throws a > "prepareCommit was already called" exception if the commit was already > prepared. Whereas commit(long) doesn't call prepareCommit (eg, it > doesn't need to flush). Without this, I was hitting exceptions in one > of the tests that calls commit() from multiple threads at the same > time. Is it better to simplify things by serializing all commit()/commit(long) calls? > This is to make sure any just-started addIndexes cleanly finish or > abort before we enter the wait loop. I was seeing cases where the > wait loop would think no more merges were pending, but in fact an > addIndexes was just getting underway and was about to start merging. > It's OK if a new addIndexes call starts up, because it'll be forced to > check the stop conditions (closing=true or stopMerges=true) and then > abort the merges. I'll add comments to this effect. I wonder if we can simplify the logic... Currently in setMergeScheduler, merges can start between finishMerges and set the merge scheduler. This one can be fixed by making setMergeScheduler synchronized. > This method has always carried out merges in the FG, but it's in fact > possible that a BG merge thread on finishing a previous merge may pull > a merge involving external segments. So I changed this method to wait > for all such BG merges to complete, because it's not allowed to return > until there are no more external segments in the index. Hmm... so merges involving external segments may be in FG or BG? So copyExternalSegments not only copies external segments, but also waits for BG merges involving external segments to finish. We need a better name? > It is tempting to fully schedule these external merges (ie allow them > to run in BG), but there is a problem: if there is some error on doing > the merge, we need that error to be thrown in the FG thread calling > copyExternalSegments (so the transcaction above unwinds). (Ie we > can't just stuff these external merges into the merge queue then wait > for their completely). Then what about those BG merges involving external segments? > Correctly handle concurrent calls to addIndexes, optimize, commit > - > > Key: LUCENE-1335 > URL: https://issues.apache.org/jira/browse/LUCENE-1335 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3, 2.3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1335.patch, LUCENE-1335.patch, LUCENE-1335.patch, > LUCENE-1335.patch > > > Spinoff from here: > > http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL > PROTECTED] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit
[ https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624998#action_12624998 ] Ning Li commented on LUCENE-1335: - I agree that we should not make any API promises about what it means when the methods (commit, close, rollback, optimize, addIndexes) are called concurrently from different threads. The discussion below is on their current behavior. > Only one addIndexes can run at once, so call to 2nd or more > addIndexes just blocks until the one is done. This is achieved by the read-write lock. > close() and rollback() wait for any running addIndexes to finish > and then blocks new addIndexes calls Just to clarify: close(waitForMerges=false) and rollback() make an ongoing addIndexes[NoOptimize](dirs) abort, but wait for addIndexes(readers) to finish. It'd be nice if they make any addIndexes* abort for a quick shutdown, but that's for later. > commit() waits for any running addIndexes, or any already running > commit, to finish, then quickly takes a snapshot of the segments > and syncs the files referenced by that snapshot. While syncing is > happening addIndexes are then allowed to run again. commit() and commit(long) use the read-write lock to wait for a running addIndexes. "committing" is used to serialize commit() calls. Why isn't it also used to serialize commit(long) calls? > optimize() is allowed to run concurrently with addIndexes; the two > simply wait for their respective merges to finish. This is nice. More detailed comments: - In finishMerges, acquireRead and releaseRead are both called. Isn't addIndexes allowed again? - In copyExternalSegments, merges involving external segments are carried out in foreground. So why the changes? To relax that assumption? But other part still makes the assumption. - addIndexes(readers) should optimize before startTransaction, no? - The newly added method segString(dir) in SegmentInfos is not used anywhere. > Correctly handle concurrent calls to addIndexes, optimize, commit > - > > Key: LUCENE-1335 > URL: https://issues.apache.org/jira/browse/LUCENE-1335 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3, 2.3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1335.patch, LUCENE-1335.patch, LUCENE-1335.patch > > > Spinoff from here: > > http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL > PROTECTED] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit
[ https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624851#action_12624851 ] Ning Li commented on LUCENE-1335: - Hi Mike, could you update the patch? I cannot apply it. Thanks! > Correctly handle concurrent calls to addIndexes, optimize, commit > - > > Key: LUCENE-1335 > URL: https://issues.apache.org/jira/browse/LUCENE-1335 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3, 2.3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1335.patch, LUCENE-1335.patch > > > Spinoff from here: > > http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL > PROTECTED] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1338) With non-deprecated constructors, IndexWriter's autoCommit is always true
[ https://issues.apache.org/jira/browse/LUCENE-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li resolved LUCENE-1338. - Resolution: Invalid When deprecated constructors are removed in 3.0, autoCommit will always be false. > With non-deprecated constructors, IndexWriter's autoCommit is always true > - > > Key: LUCENE-1338 > URL: https://issues.apache.org/jira/browse/LUCENE-1338 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li >Priority: Minor > > With non-deprecated constructors, IndexWriter's autoCommit is always true. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1338) With non-deprecated constructors, IndexWriter's autoCommit is always true
[ https://issues.apache.org/jira/browse/LUCENE-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614404#action_12614404 ] Ning Li commented on LUCENE-1338: - Or is the intention to make autoCommit always false after deprecated constructors are removed? > With non-deprecated constructors, IndexWriter's autoCommit is always true > - > > Key: LUCENE-1338 > URL: https://issues.apache.org/jira/browse/LUCENE-1338 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li >Priority: Minor > > With non-deprecated constructors, IndexWriter's autoCommit is always true. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1338) With non-deprecated constructors, IndexWriter's autoCommit is always true
With non-deprecated constructors, IndexWriter's autoCommit is always true - Key: LUCENE-1338 URL: https://issues.apache.org/jira/browse/LUCENE-1338 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Priority: Minor With non-deprecated constructors, IndexWriter's autoCommit is always true. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1228) IndexWriter.commit() does not update the index version
[ https://issues.apache.org/jira/browse/LUCENE-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578518#action_12578518 ] Ning Li commented on LUCENE-1228: - Does SegmentInfos really need both "version" and "generation"? Is "generation" sufficient? > IndexWriter.commit() does not update the index version > --- > > Key: LUCENE-1228 > URL: https://issues.apache.org/jira/browse/LUCENE-1228 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 >Reporter: Doron Cohen >Assignee: Doron Cohen > Attachments: lucene-1228-commit-reopen.patch > > > IndexWriter.commit() can update the index *version* and *generation* but the > update of *version* is lost. > As result added documents are not seen by IndexReader.reopen(). > (There might be other side effects that I am not aware of). > The fix is 1 line - update also the version in > SegmentsInfo.updateGeneration(). > (Finding this line involved more lines though... :-) ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-1035: Attachment: LUCENE-1035.contrib.patch Re-do as a contrib package. Creating BufferPooledDirectory with your customized file name filter for readers allows you to decide which files you want to use the caching layer for. The package includes some tests. I also modified and tested the core tests with the caching layer in a private setting and all tests passed. > Optional Buffer Pool to Improve Search Performance > -- > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.contrib.patch, LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1204) IndexWriter.deleteDocuments bug
[ https://issues.apache.org/jira/browse/LUCENE-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575782#action_12575782 ] Ning Li commented on LUCENE-1204: - > I think this is a false alarm. I just found out the same thing. It's a good test though. > IndexWriter.deleteDocuments bug > --- > > Key: LUCENE-1204 > URL: https://issues.apache.org/jira/browse/LUCENE-1204 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Yonik Seeley >Assignee: Michael McCandless > Attachments: LUCENE-1204.patch, LUCENE-1204.take2.patch > > > IndexWriter.deleteDocuments() fails random testing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574782#action_12574782 ] Ning Li commented on LUCENE-1035: - > It looks like this was never fully done. I wonder if this should be closed, > esp. since Ning might be working on slightly different problems now. Sorry for the delay. I'll spend some time later this week or early next week to update and make it a contrib patch. > Optional Buffer Pool to Improve Search Performance > -- > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1194) Add deleteByQuery to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572957#action_12572957 ] Ning Li commented on LUCENE-1194: - > As of LUCENE-1044, when autoCommit=true, IndexWriter only commits on > committing a merge, not with every flush. I see. Interesting. > Hmmm ... but, there is actually the reverse problem now with my patch: > an auto commit can actually commit deletes before the corresponding > added docs are committed (from updateDocument calls). This is > because, when we commit we only sync & commit the merged segments (not > the flushed segments). Yep. > Though, autoCommit=true is deprecated; once we > remove that (in 3.0) this problem goes away. I'll have to ponder how > to fix that for now up until 3.0...it's tricky. Maybe before 3.0 > we'll just have to flush all deletes whenever we flush a new > segment I think flushing deletes when we flush a new segment is fine before 3.0. In 3.0, is the plan to default autoCommit to false or to disable autoCommit entirely? The latter, right? > Also, I don't think we need updateByQuery? Eg in 3.0 when autoCommit > is hardwired to false then you can deleteDocuments(Query) and then > addDocument(...) and it will be atomic. Agree. When autoCommit is disabled, we don't need any update method. > Add deleteByQuery to IndexWriter > > > Key: LUCENE-1194 > URL: https://issues.apache.org/jira/browse/LUCENE-1194 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1194.patch > > > This has been discussed several times recently: > http://markmail.org/message/awlt4lmk3533epbe > http://www.gossamer-threads.com/lists/lucene/java-user/57384#57384 > If we add deleteByQuery to IndexWriter then this is a big step towards > allowing IndexReader to be readonly. > I took the approach suggested in that first thread: I buffer delete > queries just like we now buffer delete terms, holding the max docID > that the delete should apply to. > Then, I also decoupled flushing deletes (mapping term or query --> > actual docIDs that need deleting) from flushing added documents, and > now I flush deletes only when a merge is started, or on commit() or > close(). SegmentMerger now exports the docID map it used when > merging, and I use that to renumber the max docIDs of all pending > deletes. > Finally, I turned off tracking of memory usage of pending deletes > since they now live beyond each flush. Deletes are now only > explicitly flushed if you set maxBufferedDeleteTerms to something > other than DISABLE_AUTO_FLUSH. Otherwise they are flushed at the > start of every merge. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1194) Add deleteByQuery to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572576#action_12572576 ] Ning Li commented on LUCENE-1194: - Great to see deleteByQuery being added to IndexWriter! > Then, I also decoupled flushing deletes (mapping term or query --> > actual docIDs that need deleting) from flushing added documents, and > now I flush deletes only when a merge is started, or on commit() or > close(). When autoCommit is true, we have to flush deletes with added documents for update atomicity, don't we? UpdateByQuery can be added, if there is a need. > SegmentMerger now exports the docID map it used when merging, > and I use that to renumber the max docIDs of all pending deletes. Because of renumbering, we don't have to flush deletes at the start of every merge, right? But it is a good time to flush deletes. > Add deleteByQuery to IndexWriter > > > Key: LUCENE-1194 > URL: https://issues.apache.org/jira/browse/LUCENE-1194 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1194.patch > > > This has been discussed several times recently: > http://markmail.org/message/awlt4lmk3533epbe > http://www.gossamer-threads.com/lists/lucene/java-user/57384#57384 > If we add deleteByQuery to IndexWriter then this is a big step towards > allowing IndexReader to be readonly. > I took the approach suggested in that first thread: I buffer delete > queries just like we now buffer delete terms, holding the max docID > that the delete should apply to. > Then, I also decoupled flushing deletes (mapping term or query --> > actual docIDs that need deleting) from flushing added documents, and > now I flush deletes only when a merge is started, or on commit() or > close(). SegmentMerger now exports the docID map it used when > merging, and I use that to renumber the max docIDs of all pending > deletes. > Finally, I turned off tracking of memory usage of pending deletes > since they now live beyond each flush. Deletes are now only > explicitly flushed if you set maxBufferedDeleteTerms to something > other than DISABLE_AUTO_FLUSH. Otherwise they are flushed at the > start of every merge. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538638 ] Ning Li commented on LUCENE-1035: - > The question is whether such situations are common enough to warrant adding > this to the core. Agree. > A way around that might be to layer it on top of FSDirectory and add it to > contrib. I'd be happy to do that. I'll also include the following in the javadoc which hopefully is a fair assessment: "When will a buffer pool help: - When an index is significantly larger than the file system cache, the hit ratio of a buffer pool is probably low which means insignificant performance improvement. - When an index is about the size of the file system cache or smaller, a buffer pool with good enough hit ratio will help if the IO system calls are the bottleneck. An example is if you have many "AND" queries which causes a lot large skips." > Optional Buffer Pool to Improve Search Performance > -- > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538129 ] Ning Li commented on LUCENE-1035: - > That seems like quite a few docs to retrieve--any particular reason why? I was guessing most applications won't want all 590K results, no? Lucene is used in so many different ways. No represent-all use case. > I echo Hoss' comment--proximity searching is important even if it isn't used > much directly by users. Hmm, I agree with you and Hoss, especially in applications where proximity is used to rank results of OR queries. > Optional Buffer Pool to Improve Search Performance > -- > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538112 ] Ning Li commented on LUCENE-1035: - > I'll change to "OR" queries and see what happens. Query set with average 590K results, retrieving docids for the first 5K Buffer Pool SizeHit RatioQueries per second 0 N/A 1.9 16M 53% 1.9 32M 68% 2.0 64M 90% 2.3 128M/256M/512M 99% 2.3 As Yonik pointed out, in the previous "AND" tests, the bottleneck is the system call to move data from file system cache to userspace. Here in the "OR" tests, much fewer such calls are made therefore the speedup is less significant. Wish I could get a real query workload for this dataset. > Actually, phrase queries would be really interesting too since they hit the > term positions. Phrase queries are rare and term distribution is highly skewed according to the following study on the Excite query log: Spink, Amanda and Xu, Jack L. (2000) "Selected results from a large study of Web searching: the Excite study". Information Research, 6(1) Available at: http://InformationR.net/ir/6-1/paper90.html "4. Phase Searching: Phrases (terms enclosed by quotation marks) were seldom, while only 1 in 16 queries contained a phrase - but correctly used. 5. Search Terms: Distribution: Jansen, et al., (2000) report the distribution of the frequency of use of terms in queries as highly skewed." I didn't find a good on on the AOL query log. In any case, this buffer pool is not intended for general purpose. I mentioned RAMDirectory earlier. This is more like an alternative to RAMDirectory (that's why it's per directory): you want persistent storage for the index, yet it's not too big that you want RAMDirectory search performance. In addition, the entire index doesn't have to fit into memory, as long as the most queried part does. Hopefully, this benefits a subset of Lucene use cases. > did you compare it against MMAP? I The index I experimented on didn't fit in memory... > Optional Buffer Pool to Improve Search Performance > -- > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537995 ] Ning Li commented on LUCENE-1035: - > most lucene usecases store much more than just the document id... that would > really affect locality. In the experiments, I was simulating the (Google) paradigm where you retrieve just the docids and go to document servers for other things. If store almost always negatively affects locality, we can make the buffer pool sit only on data/files which we expect good locality (say posting lists), but not others. > It seems like a simple LRU cache could really be blown out of the water by > certain types of queries (retrieve a lot of stored fields, or do an expanding > term query) that would force out all previously cached hotspots. Most OS > level caching has protection against this (multi-level LRU or whatever). But > of our user-level LRU cache fails, we've also messed up the OS level cache > since we've been hiding page hits from it. That's a good point. We can improve the algorithm but hopefully still keep it simple and general. This buffer pool is not a fit-all solution. But hopefully it will benefit a number of use cases. That's why I say "optional". :) > I'd like to see single term queries, "OR" queries, and queries across > multiple fields (also a common usecase) that match more documents tested also. I'll change to "OR" queries and see what happens. The dataset is enwiki with four fields: docid, date (optional), title and body. Most terms are from title and body. > Optional Buffer Pool to Improve Search Performance > -- > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537978 ] Ning Li commented on LUCENE-1035: - > Were the tests run using the same set of queries they were warmed for? Yes, the same set of queries were used. The warm-up and the real run are two separate runs, which means the file system cache is warmed, but not the buffer pool. Yes, it'd much better if a real query log could be obtained. I'll take a look at the AOL query log. I used to have an intranet query log which has a lot of term locality. That's why I think this could provide a good improvement. > There are better ways to optimize for that, e.g., by caching hit lists, no? That's useful and that's for exact query match. If there are a lot of shared query term but not exact query match, caching hit list won't help. This is sort of caching posting list but simpler. > Optional Buffer Pool to Improve Search Performance > -- > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537972 ] Ning Li commented on LUCENE-1035: - > I don't think this is any better than the NIOFileCache directory I had > already submitted. Are you referring to LUCENE-414? I just read it and yes, it's similar to the MemoryLRUCache part. Hopefully this is more general, not just for NioFile. > It not really approved because the community felt that it did not offer much > over the standard OS file system cache. Well, it shows it has its value in cases where you can achieve a reasonable hit ratio, right? This is optional. An application can test with its query log first to see the hit ratio and then decide whether to use a buffer pool and if so, how large. > Optional Buffer Pool to Improve Search Performance > -- > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-1035: Summary: Optional Buffer Pool to Improve Search Performance (was: ptional Buffer Pool to Improve Search Performance) > Optional Buffer Pool to Improve Search Performance > -- > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1035) ptional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-1035: Lucene Fields: [Patch Available] (was: [New]) > ptional Buffer Pool to Improve Search Performance > - > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1035) ptional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-1035: Attachment: LUCENE-1035.patch Coding Changes -- New classes are localized to the store package and so as most of the changes. - Two new interfaces: BareInput and BufferPool. - BareInput takes a subset of IndexInput's methods such as readBytes (IndexInput now implements BareInput). - BufferPoolLRU is a simple implementation of BufferPool interface. It uses a doubly linked list for the LRU algorithm. - BufferPooledIndexInput is a subclass of BufferedIndexInput. It takes a BareInput and a BufferPool. For BufferedIndexInput's readInternal, it will read from the BufferPool, and BufferPool will read from its cache if it's a hit and read from BareInput if it's a miss. - A FSDirectory object can optionally be created with a BufferPool with its size specified by a buffer size and number of buffers. BufferPool is shared among IndexInput of read-only files in the directory. Unit tests - TestBufferPoolLRU.java is added. - Minor changes were made to _TestHelper.java and TestCompoundFile.java because they made specific assumptions of the type of IndexInput returns by FSDirectory.openInput. - All unit tests pass when I switch to always use a BufferPool. Performance Results --- I ran some experiments using the enwiki dataset. The experiments were run on a dual 2.0Ghz Intel Xeon server running Linux. The dataset has about 3.5M documents and the index built from it is more than 3G. The only store field is a unique docid which is retrieved for each query result. All queries are two-term AND queries generated from the dictionary. The first set of queries returns between 1 to 1000 results with an average of 40. The second set returns between 1 to 3000 with an average of 560. All tests were run warm. 1 Query set with average 40 results Buffer Pool SizeHit RatioQueries per second 0 N/A230 16M 55%250 32M 63%282 64M 73%345 128M 85%476 256M 95%672 512M 98%685 2 Query set with average 560 results Buffer Pool SizeHit RatioQueries per second 0 N/A 27 16M 56% 29 32M 70% 37 64M 89% 55 128M 97% 67 256M 98% 71 512M 99% 72 Of course if the tests are run cold, or if the queried portion of the index is significantly larger than the file system cache, or there are a lot of pre-processing of the queries and/or post-processing of the results, the speedup will be less. But where it applies, i.e. a reasonable hit ratio can be achieved, it should provide a good improvement. > ptional Buffer Pool to Improve Search Performance > - > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1035) ptional Buffer Pool to Improve Search Performance
ptional Buffer Pool to Improve Search Performance - Key: LUCENE-1035 URL: https://issues.apache.org/jira/browse/LUCENE-1035 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Ning Li Index in RAMDirectory provides better performance over that in FSDirectory. But many indexes cannot fit in memory or applications cannot afford to spend that much memory on index. On the other hand, because of locality, a reasonably sized buffer pool may provide good improvement over FSDirectory. This issue aims at providing such an optional buffer pool layer. In cases where it fits, i.e. a reasonable hit ratio can be achieved, it should provide a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1007) Flexibility to turn on/off any flush triggers
[ https://issues.apache.org/jira/browse/LUCENE-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531513 ] Ning Li commented on LUCENE-1007: - One more thing about the approximation of actual bytes used for buffered delete term: just remember Integer.SIZE returns the number of bits used, should convert it to number of bytes. > Flexibility to turn on/off any flush triggers > - > > Key: LUCENE-1007 > URL: https://issues.apache.org/jira/browse/LUCENE-1007 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Ning Li >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1007.patch, LUCENE-1007.take2.patch, > LUCENE-1007.take3.patch > > > See discussion at http://www.gossamer-threads.com/lists/lucene/java-dev/53186 > Provide the flexibility to turn on/off any flush triggers - ramBufferSize, > maxBufferedDocs and maxBufferedDeleteTerms. One of ramBufferSize and > maxBufferedDocs must be enabled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1007) Flexibility to turn on/off any flush triggers
[ https://issues.apache.org/jira/browse/LUCENE-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-1007: Attachment: LUCENE-1007.take2.patch Take2 counts buffered delete terms towards ram buffer used. A test case for it is added. > Flexibility to turn on/off any flush triggers > - > > Key: LUCENE-1007 > URL: https://issues.apache.org/jira/browse/LUCENE-1007 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Ning Li >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1007.patch, LUCENE-1007.take2.patch > > > See discussion at http://www.gossamer-threads.com/lists/lucene/java-dev/53186 > Provide the flexibility to turn on/off any flush triggers - ramBufferSize, > maxBufferedDocs and maxBufferedDeleteTerms. One of ramBufferSize and > maxBufferedDocs must be enabled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1007) Flexibility to turn on/off any flush triggers
[ https://issues.apache.org/jira/browse/LUCENE-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-1007: Attachment: LUCENE-1007.patch Just got around to do the patch: - The patch includes changes to IndexWriter and DocumentsWriter to provide the flexibility to turn on/off any flush triggers. - Necessary changes to a couple of unit tests. - Also remove some unused imports. - All unit tests pass. One question: Should we count buffered delete terms towards ram buffer used? Feel like we should. On the other hand, numBytesUsed only counts ram space which can be recycled. > Flexibility to turn on/off any flush triggers > - > > Key: LUCENE-1007 > URL: https://issues.apache.org/jira/browse/LUCENE-1007 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Ning Li >Priority: Minor > Attachments: LUCENE-1007.patch > > > See discussion at http://www.gossamer-threads.com/lists/lucene/java-dev/53186 > Provide the flexibility to turn on/off any flush triggers - ramBufferSize, > maxBufferedDocs and maxBufferedDeleteTerms. One of ramBufferSize and > maxBufferedDocs must be enabled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1007) Flexibility to turn on/off any flush triggers
Flexibility to turn on/off any flush triggers - Key: LUCENE-1007 URL: https://issues.apache.org/jira/browse/LUCENE-1007 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Ning Li Priority: Minor See discussion at http://www.gossamer-threads.com/lists/lucene/java-dev/53186 Provide the flexibility to turn on/off any flush triggers - ramBufferSize, maxBufferedDocs and maxBufferedDeleteTerms. One of ramBufferSize and maxBufferedDocs must be enabled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527286 ] Ning Li commented on LUCENE-847: > This was actually intentional: I thought it fine if the application is > sending multiple threads into IndexWriter to allow merges to run > concurrently. Because, the application can always back down to a > single thread to get everything serialized if that's really required? Today, applications use multiple threads on IndexWriter to get some concurrency on document parsing. With this patch, applications that want concurrent merges would simply use ConcurrentMergeScheduler, no? Or a rename since it doesn't really serialize merges? > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Fix For: 2.3 > > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, > LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, > LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527239 ] Ning Li commented on LUCENE-847: Hmm, it's actually possible to have concurrent merges with SerialMergeScheduler. Making SerialMergeScheduler.merge synchronize on SerialMergeScheduler will serialize all merges. A merge can still be concurrent with a ram flush. Making SerialMergeScheduler.merge synchronize on IndexWriter will serialize all merges and ram flushes. > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Fix For: 2.3 > > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, > LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, > LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527224 ] Ning Li commented on LUCENE-847: Access of mergeThreads in ConcurrentMergeScheduler.merge() should be synchronized. > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Fix For: 2.3 > > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, > LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, > LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526628 ] Ning Li commented on LUCENE-847: > OK, another rev of the patch (take6). I think it's close! Yes, it's close! :) > I made one simplification to the approach: IndexWriter now keeps track > of "pendingMerges" (merges that mergePolicy has declared are necessary > but have not yet been started), and "runningMerges" (merges currently > in flight). Then MergeScheduler just asks IndexWriter for the next > pending merge when it's ready to run it. This also cleaned up how > cascading works. I like this simplification. > * Optimize: optimize is now fully concurrent (it can run multiple > merges at once, new segments can be flushed during an optimize, > etc). Optimize will optimize only those segments present when it > started (newly flushed segments may remain separate). This semantics does add a bit complexity - segmentsToOptimize, OneMerge.optimize. > Good idea! I took exactly this approach in patch I just attached. I > made a simple change: LogMergePolicy.findMergesForOptimize first > checks if "normal merging" would want to do merges and returns them if > so. Since "normal merging" exposes concurrent merges, this gains us > concurrency for optimize in cases where the index has too many > segments. I wasn't sure how otherwise to expose concurrency... Another option is to schedule merges for the newest N segments and the next newest N segments and the next next... N is the merge factor. A couple of other things: - It seems you intended sync() to be part of the MergeScheduler interface? - IndexWriter.close([true]), abort(): The behaviour should be the same whether the calling thread is the one that actually gets to do the closing. Right now, only the thread that actually does the closing waits for the closing. The others do not wait for the closing. > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Fix For: 2.3 > > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, > LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526029 ] Ning Li commented on LUCENE-847: Comments on optimize(): - In the while loop of optimize(), LogMergePolicy.findMergesForOptimize returns a merge spec with one merge. If ConcurrentMergeScheduler is used, the one merge will be started in MergeScheduler.merge() and findMergesForOptimize will be called again. Before the one merge finishes, findMergesForOptimize will return the same spec but the one merge is already started. So only one concurrent merge is possible and the main thread will spin on calling findMergesForOptimize and attempting to merge. - One possible solution is to make LogMergePolicy.findMergesForOptimize return multiple merge candidates. It allows higher level of concurrency. It also alleviates a bit the problem of main thread spinning. To solve this problem, maybe we can check if a merge is actually started, then sleep briefly if not (which means all merges candidates are in conflict)? A comment on concurrent merge threads: - One difference between the current approach on concurrent merge and the patch I posted a while back is that, in the current approach, a MergeThread object is created and started for every concurrent merge. In my old patch, maxThreadCount of threads are created and started at the beginning and are used throughout. Both have pros and cons. > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Fix For: 2.3 > > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, > LUCENE-847.take5.patch, LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-992) IndexWriter.updateDocument is no longer atomic
[ https://issues.apache.org/jira/browse/LUCENE-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525271 ] Ning Li commented on LUCENE-992: The patch looks good! A few comments and/or observations: - addDocument(Document doc, Analyzer analyzer, Term delTerm): is it better to name it updateDocument? - I didn't check all the variable accesses in DocumentsWriter, but it seems abort() should lock for some of the variables it accesses. Or make abort() a synchronized method. - Observation: Large documents will block small documents from being flushed if addDocument of large documents is called before that of small ones. This is not the case before LUCENE-843. > I also slightly changed the exception semantics in IndexWriter: > previously if a disk full (or other) exception was hit when flushing > the buffered docs, the buffered deletes were retained but the > partially flushed buffered docs (if any) were discarded. - Observation: Before LUCENE-843, both buffered docs and buffered deletes were retained when such an exception occurs. Now both buffered docs and buffered deletes would be discared if an exception is hit. > IndexWriter.updateDocument is no longer atomic > -- > > Key: LUCENE-992 > URL: https://issues.apache.org/jira/browse/LUCENE-992 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.2 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-992.patch > > > Spinoff from LUCENE-847. > Ning caught that as of LUCENE-843, we lost the atomicity of the delete > + add in IndexWriter.updateDocument. > Ning suggested a simple fix: move the buffered deletes into > DocumentsWriter and let it do the delete + add atomically. This has a > nice side effect of also consolidating the "time to flush" logic in > DocumentsWriter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524084 ] Ning Li commented on LUCENE-847: > Not quite following you here... not being eligible because the merge > is in-progress in a thread is something I think any given MergePolicy > should not have to track? Once I factor out CMPW as its own Merger > subclass I think the eligibility check happens only in IndexWriter? I was referring to the current patch: LogMergePolicy does not check for eligibility, but CMPW, a subclass of MergePolicy, checks for eligibility. Yes, the eligibility check only happens in IndexWriter after we do Merger class. > Rename to/from what? (It is currently called MergePolicy.optimize). > IndexWriter steps through the merges and only runs the ones that do > not conflict (are eligible)? Maybe rename to MergePolicy.findMergesToOptimize? > > The reason I asked is because none of them are used right now. So > > they might be used in the future? > > Both of these methods are now called by IndexWriter (in the patch), > upon flushing a new segment. I was referring to the parameters. The parameters are not used. > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Fix For: 2.3 > > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, > LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523957 ] Ning Li commented on LUCENE-847: > True, but I was thinking CMPW could be an exception to this rule. I > guess I would change the rule to "simple merge policies don't have to > run their own merges"? :) Let's see if we have to make that exception. > Good point... I think I could refactor this so that cascading logic > lives entirely in one place IndexWriter. Another problem of the current cascading in CMPW.MergeThread is, if multiple candidate merges are found, all of them are added to IndexWriter.mergingSegments. But all but the first should be removed because only the first merge is carried out (thus removed from mergeSegments after the merge is done). How do you make cascading live entirely in IndexWriter? Just removing cascading from CMPW.MergeThread has one drawback. For example, segment sizes of an index are: 40, 20, 10, buffer size is 10 and merge factor is 2. A buffer full flush of 10 will trigger merge of 10 & 10, then cascade to 20 & 20, then cascade to 40 & 40. CMPW without cascading will stop after 10 & 10 since IndexWriter.maybeMerge has already returned. Then we have to wait for the next flush to merge 20 & 20. > How would this be used? Ie, how would one make an IndexWriter that > uses the ConcurrentMerger? Would we add expert methods > IndexWriter.set/getIndexMerger(...)? (And presumably the mergePolicy > is now owned by IndexMerger so it would have the > set/getMergePolicy(...)?) > > Also, how would you separate what remains in IW vs what would be in > IndexMerger? Maybe Merger does and only does merge (so IndexWriter still owns MergePolicy)? Say, base class Merger.merge simply calls IndexWriter.merge. ConcurrentMerger.merge creates a merge thread if possible. Otherwise it calls super.merge, which does non-concurrent merge. IndexWriter simply calls its merger's merge instead of its own merge. Everything else remains in IndexWriter. 1 > Hmm ... you're right. This is a separate issue from merge policy, > right? Are you proposing buffering deletes in DocumentsWriter > instead? Yes, this is a separate issue. And yes if we consider DocumentsWriter as staging area. 2 > Good catch! How to fix? One thing we could do is always use > SegmentInfo.reset(...) and never swap in clones at the SegmentInfo > level. This way using the default 'equals' (same instance) would > work. Or we could establish identity (equals) of a SegmentInfo as > checking if the directory plus segment name are equal? I think I'd > lean to the 2nd option I think the 2nd option is better. 3 > Hmmm yes. In fact I think we can remove synchronized from optimize > altogether since within it we are synchronizing(this) at the right > places? If more than one thread calls optimize at once, externally, > it is actually OK: they will each pick a merge that's viable > (concurrently) and will run the merge, else return once there is no > more concurrency left. I'll add a unit test that confirms this. That seems to be the case. The fact that "the same merge spec will be returned without changes to segmentInfos" reminds me: MergePolicy.findCandidateMerges finds merges which may not be eligible; but CMPW checks for eligibility when looking for candidate merges. Maybe we should unify the behaviour? BTW, MergePolicy.optimize (a rename?) doesn't check for eligibility either. 4 > Well, useCompoundFile(...) is given a single newly flushed segment and > should decide whether it should be CFS. Whereas > useCompoundDocStore(...) is called when doc stores are flushed. When > autoCommit=false, segments can share a single set of doc stores, so > there's no single SegmentInfo to pass down. The reason I asked is because none of them are used right now. So they might be used in the future? > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Fix For: 2.3 > > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, > LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523621 ] Ning Li commented on LUCENE-847: I include comments for both LUCENE-847 and LUCENE-870 here since they are closely related. I like the stateless approach used for refactoring merge policy. But modeling concurrent merge (ConcurrentMergePolicyWrapper) as a MergePolicy seems to be inconsistent with the MergePolicy interface: 1 As you pointed out, "the merge policy is no longer responsible for running the merges itself". MergePolicy.maybeMerge simply returns a merge specification. But ConcurrentMergePolicyWrapper.maybeMerge actually starts concurrent merge threads thus doing the merges. 2 Related to 1, cascading is done in IndexWriter in non-concurrent case. But in concurrent case, cascading is also done in merge threads which are started by ConcurrentMergePolicyWrapper.maybeMerge. MergePolicy.maybeMerge should continue to simply return a merge specification. (BTW, should we rename this maybeMerge to, say, findCandidateMerges?) Can we carve the merge process out of IndexWriter into a Merger? IndexWriter still provides the building blocks - merge(OneMerge), mergeInit(OneMerge), etc. Merger uses these building blocks. A ConcurrentMerger extends Merger but starts concurrent merge threads as ConcurrentMergePolicyWrapper does. Other comments: 1 UpdateDocument's and deleteDocument's bufferDeleteTerm are synchronized on different variables in this patch. However, the semantics of updateDocument changed since LUCENE-843. Before LUCENE-843, updateDocument, which is a delete and an insert, guaranteed the delete and the insert are committed together (thus an update). Now it's possible that they are committed in different transactions. If we consider DocumentsWriter as the RAM staging area for IndexWriter, then deletes are also buffered in RAM staging area and we can restore our previous semantics, right? 2 OneMerge.segments seems to rely on its segment infos' reference to segment infos of IndexWriter.segmentInfos. The use in commitMerge, which calls ensureContiguousMerge, is an example. However, segmentInfos can be a cloned copy because of exceptions, thus the reference broken. 3 Calling optimize of an IndexWriter with the current ConcurrentMergePolicyWrapper may cause deadlock: the one merge spec returned by MergePolicy.optimize may be in conflict with a concurrent merge (the same merge spec will be returned without changes to segmentInfos), but a concurrent merge cannot finish because optimize is holding the lock. 4 Finally, a couple of minor things: 1 LogMergePolicy.useCompoundFile(SegmentInfos infos, SegmentInfo info) and useCompoundDocStore(SegmentInfos infos): why the parameters? 2 Do we need doMergeClose in IndexWriter? Can we simply close a MergePolicy if not null? > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Fix For: 2.3 > > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, > LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-987) Deprecate IndexModifier
[ https://issues.apache.org/jira/browse/LUCENE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-987: --- Attachment: deprecateIndexModifier.patch > Deprecate IndexModifier > --- > > Key: LUCENE-987 > URL: https://issues.apache.org/jira/browse/LUCENE-987 > Project: Lucene - Java > Issue Type: Test > Components: Index >Reporter: Ning Li >Priority: Minor > Attachments: deprecateIndexModifier.patch > > > See discussion at > http://www.gossamer-threads.com/lists/lucene/java-dev/52017?search_string=deprecating%20indexmodifier;#52017 > This is to deprecate IndexModifier before 3.0 and remove it in 3.0. > This patch includes: > 1 IndexModifier and TestIndexModifier are deprecated. > 2 TestIndexWriterModify is added. It is similar to TestIndexModifer but > uses IndexWriter and has a few other changes. The changes are because of the > difference between IndexModifier and IndexWriter. > 3 TestIndexWriterLockRelease and TestStressIndexing are switched to use > IndexWriter instead of IndexModifier. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-987) Deprecate IndexModifier
Deprecate IndexModifier --- Key: LUCENE-987 URL: https://issues.apache.org/jira/browse/LUCENE-987 Project: Lucene - Java Issue Type: Test Components: Index Reporter: Ning Li Priority: Minor See discussion at http://www.gossamer-threads.com/lists/lucene/java-dev/52017?search_string=deprecating%20indexmodifier;#52017 This is to deprecate IndexModifier before 3.0 and remove it in 3.0. This patch includes: 1 IndexModifier and TestIndexModifier are deprecated. 2 TestIndexWriterModify is added. It is similar to TestIndexModifer but uses IndexWriter and has a few other changes. The changes are because of the difference between IndexModifier and IndexWriter. 3 TestIndexWriterLockRelease and TestStressIndexing are switched to use IndexWriter instead of IndexModifier. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-978) GC resources in TermInfosReader when exception occurs in its constructor
[ https://issues.apache.org/jira/browse/LUCENE-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-978: --- Attachment: Readers.patch Similar fixes are added for FieldsReader and TermVectorsReader as well. > GC resources in TermInfosReader when exception occurs in its constructor > > > Key: LUCENE-978 > URL: https://issues.apache.org/jira/browse/LUCENE-978 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li >Priority: Minor > Attachments: Readers.patch, TermInfosReader.patch > > > I replaced IndexModifier with IndexWriter in test case TestStressIndexing and > noticed the test failed from time to time because some .tis file is still > open when MockRAMDirectory.close() is called. It turns out it is because .tis > file is not closed if an exception occurs in TermInfosReader's constructor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-978) GC resources in TermInfosReader when exception occurs in its constructor
[ https://issues.apache.org/jira/browse/LUCENE-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520286 ] Ning Li commented on LUCENE-978: > Agreed. Actually, it also looks like we need to do something similar for > FieldsReader/TermVectorsReader too? That's right. I'll submit a new patch. > GC resources in TermInfosReader when exception occurs in its constructor > > > Key: LUCENE-978 > URL: https://issues.apache.org/jira/browse/LUCENE-978 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li >Priority: Minor > Attachments: TermInfosReader.patch > > > I replaced IndexModifier with IndexWriter in test case TestStressIndexing and > noticed the test failed from time to time because some .tis file is still > open when MockRAMDirectory.close() is called. It turns out it is because .tis > file is not closed if an exception occurs in TermInfosReader's constructor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-978) GC resources in TermInfosReader when exception occurs in its constructor
[ https://issues.apache.org/jira/browse/LUCENE-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-978: --- Lucene Fields: [Patch Available] (was: [New]) > GC resources in TermInfosReader when exception occurs in its constructor > > > Key: LUCENE-978 > URL: https://issues.apache.org/jira/browse/LUCENE-978 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li >Priority: Minor > Attachments: TermInfosReader.patch > > > I replaced IndexModifier with IndexWriter in test case TestStressIndexing and > noticed the test failed from time to time because some .tis file is still > open when MockRAMDirectory.close() is called. It turns out it is because .tis > file is not closed if an exception occurs in TermInfosReader's constructor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-978) GC resources in TermInfosReader when exception occurs in its constructor
[ https://issues.apache.org/jira/browse/LUCENE-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-978: --- Attachment: TermInfosReader.patch > GC resources in TermInfosReader when exception occurs in its constructor > > > Key: LUCENE-978 > URL: https://issues.apache.org/jira/browse/LUCENE-978 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li >Priority: Minor > Attachments: TermInfosReader.patch > > > I replaced IndexModifier with IndexWriter in test case TestStressIndexing and > noticed the test failed from time to time because some .tis file is still > open when MockRAMDirectory.close() is called. It turns out it is because .tis > file is not closed if an exception occurs in TermInfosReader's constructor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-978) GC resources in TermInfosReader when exception occurs in its constructor
GC resources in TermInfosReader when exception occurs in its constructor Key: LUCENE-978 URL: https://issues.apache.org/jira/browse/LUCENE-978 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Priority: Minor Attachments: TermInfosReader.patch I replaced IndexModifier with IndexWriter in test case TestStressIndexing and noticed the test failed from time to time because some .tis file is still open when MockRAMDirectory.close() is called. It turns out it is because .tis file is not closed if an exception occurs in TermInfosReader's constructor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518520 ] Ning Li commented on LUCENE-847: > Furthermore, I think this is all contained within IndexWriter, right? > Ie when we go to "replace/checkin" the newly merged segment, this > "merge newly flushed deletes" would execute at that time. And, I > think, we would block flushes while this is happening, but > addDocument/deleteDocument/updateDocument would still be allowed? Yes and yes. :-) > Couldn't we also just update the docIDs of pending deletes, and not > flush? Ie we know the mapping of old -> new docID caused by the > merge, so we can run through all deleted docIDs and remap? Hmm, I was worried quite a number of delete docIDs could be buffered, but I guess it's still better than having to do a flush. So yes, this is better! > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518486 ] Ning Li commented on LUCENE-847: The following comments are about the impact on merge if we add "deleteDocument(int doc)" (and deprecate IndexModifier). Since it concerns the topic in this issue, I also post it here to get your opinions. I'm thinking about the impact of adding "deleteDocument(int doc)" on LUCENE-847, especially on concurrent merge. The semantics of "deleteDocument(int doc)" is that the document to delete is specified by the document id on the index at the time of the call. When a merge is finished and the result is being checked into IndexWriter's SegmentInfos, document ids may change. Therefore, it may be necessary to flush buffered delete doc ids (thus buffered docs and delete terms as well) before a merge result is checked in. The flush is not necessary if there is no buffered delete doc ids. I don't think it should be the reason not to support "deleteDocument(int doc)" in IndexWriter. But its impact on concurrent merge is a concern. > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518453 ] Ning Li commented on LUCENE-847: On 8/8/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: > Actually I was talking about my idea (to "simplify MergePolicy.merge > API"). With the simplification (whereby MergePolicy.merge just > returns the MergeSpecification instead of driving the merge itself) I > believe it's simple to make a concurrency wrapper around any merge > policy, and, have all necessary locking for SegmentInfos inside > IndexWriter. I agree with Mike. In fact, MergeSelector.select, which is the counterpart of MergePolicy.merge in the patch I submitted for concurrent merge, simply returns a MergeSpecification. It's simple and sufficient to have all necessary lockings for SegmentInfos in one class, say IndexWriter. For example, IndexWriter locks SegmentInfos when MergePolicy(MergeSelector) picks a merge spec. Another example, when a merge is finished, say IndexWriter.checkin is called which locks SegmentInfos and replaces the source segment infos with the target segment info. On 8/7/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote: > The synchronization is still tricky, since parts of segmentInfos are > getting changed at various times and there are references and/or > copies of it other places. And as Ning pointed out to me, we also > have to deal with buffered delete terms. I'd say I got about 80% of >the way there on the last go around. I'm hoping to get all the way > this time. It just occurred to me that there is a neat way to handle deletes that are flushed during a concurrent merge. For example, MergePolicy decides to merge segments B and C, with B's delete file 0001 and C's 100. When the concurrent merge finishes, B's delete file becomes 0011 and C's 110. We do a simple computation on the delete bit vectors and check in the merged segment with delete file 00110. > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, > LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-938) I/O exceptions can cause loss of buffered deletes
[ https://issues.apache.org/jira/browse/LUCENE-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512271 ] Ning Li commented on LUCENE-938: I didn't make myself clear. Let me try again. The patch includes two parts of changes to IndexWriter: one adds localNumBufferedDeleteTerms and localBufferedDeleteTerms and uses them in startTransaction() and rollbackTransaction(); the other fixes loss of buffered deletes in flush() (and applyDeletes() which is used by flush()). The second part is good and that's where you had the comment on cloning. I was referring to the first part. In startTransaction(), "localBufferedDeleteTerms = bufferedDeleteTerms" reference-copies bufferedDeleteTerms. Then more delete terms are buffered into bufferedDeleteTerms... so localBufferedDeleteTerms would have the delete terms buffered between startTransaction() and the first flush()... > I/O exceptions can cause loss of buffered deletes > - > > Key: LUCENE-938 > URL: https://issues.apache.org/jira/browse/LUCENE-938 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Fix For: 2.3 > > Attachments: LUCENE-938.take2.patch, LUCENE-938.txt, LUCENE-938.txt > > > Some I/O exceptions that result in segmentInfos rollback operations can cause > buffered deletes that existed before the rollback creation point to be > incorrectly lost when the IOException triggers a rollback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-938) I/O exceptions can cause loss of buffered deletes
[ https://issues.apache.org/jira/browse/LUCENE-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510422 ] Ning Li commented on LUCENE-938: Good catch, Steven! One thing though: I thought we had assumed that there wouldn't be any buffered docs or delete terms when startTransaction(), so no local copies are necessary. That means no change to startTransaction() and rollbackTransaction(). If there could be buffered docs and delete terms when startTransaction(), then local copies should be made for buffered docs and localNumBufferedDeleteTerms should clone numBufferedDeleteTerms instead of just copying the reference. > I/O exceptions can cause loss of buffered deletes > - > > Key: LUCENE-938 > URL: https://issues.apache.org/jira/browse/LUCENE-938 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Steven Parkes >Assignee: Steven Parkes > Fix For: 2.3 > > Attachments: LUCENE-938.take2.patch, LUCENE-938.txt, LUCENE-938.txt > > > Some I/O exceptions that result in segmentInfos rollback operations can cause > buffered deletes that existed before the rollback creation point to be > incorrectly lost when the IOException triggers a rollback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-847) Factor merge policy out of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-847: --- Attachment: concurrentMerge.patch Here is a patch for concurrent merge as discussed in: http://www.gossamer-threads.com/lists/lucene/java-dev/45651?search_string=concurrent%20merge;#45651 I put it under this issue because it helps design and verify a factored merge policy which would provide good support for concurrent merge. As described before, a merge thread is started when a writer is created and stopped when the writer is closed. The merge process consists of three steps: first, create a merge task/spec; then, carry out the actual merge; finally, "commit" the merged segment (replace segments it merged in segmentInfos), but only after appropriate deletes are applied. The first and last steps are fast and synchronous. The second step is where concurrency is achieved. Does it make sense to capture them as separate steps in the factored merge policy? As discussed in http://www.gossamer-threads.com/lists/lucene/java-dev/45651?search_string=concurrent%20merge;#45651: documents can be buffered while segments are merged, but no more than maxBufferedDocs can be buffered at any time. So this version provides limited concurrency. The main goal is to achieve short ingestion hiccups, especially when the ingestion rate is low. After the factored merge policy, we could provide different versions of concurrent merge policies which provide different levels of concurrency. :-) All unit tests pass. If IndexWriter is replaced with IndexWriterConcurrentMerge, all unit tests pass except the following: - TestAddIndexesNoOptimize and TestIndexWriter* This is because they check segment sizes expecting all merges are done. These tests pass if these checks are performed after the concurrent merges finish. The modified tests (with waits for concurrent merges to finish) are in TestIndexWriterConcurrentMerge*. - testExactFieldNames in TestBackwardCompatibility and testDeleteLeftoverFiles in TestIndexFileDeleter In both cases, file name segments_a is expected, but the actual is segments_7. This is because with concurrent merge, if compound file is used, only the compound version is "committed" (added to segmentInfos), not the non-compound version, thus the lower segments generation number. Cheers, Ning > Factor merge policy out of IndexWriter > -- > > Key: LUCENE-847 > URL: https://issues.apache.org/jira/browse/LUCENE-847 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Steven Parkes > Assigned To: Steven Parkes > Attachments: concurrentMerge.patch, LUCENE-847.txt > > > If we factor the merge policy out of IndexWriter, we can make it pluggable, > making it possible for apps to choose a custom merge policy and for easier > experimenting with merge policy variants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ https://issues.apache.org/jira/browse/LUCENE-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-565: --- Lucene Fields: [Patch Available] > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: https://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: NewIndexModifier.Jan2007.patch, > NewIndexModifier.Sept21.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ https://issues.apache.org/jira/browse/LUCENE-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Li updated LUCENE-565: --- Attachment: NewIndexModifier.Jan2007.patch The patch is updated because of the code committed to IndexWriter since the last patch. The high-level design is the same as before. See comments on 18/Dec/06. Care has been taken to make sure if writer/modifier tries to commit but hits disk full that writer/modifier remains consistent and usable. A test case is added to TestNewIndexModifierDelete to test this. All tests pass. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: https://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: NewIndexModifier.Jan2007.patch, > NewIndexModifier.Sept21.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > S
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12459506 ] Ning Li commented on LUCENE-565: Here is the design overview. Minor changes were made because of lock-less commits. In the current IndexWriter, newly added documents are buffered in ram in the form of one-doc segments. When a flush is triggered, all ram documents are merged into a single segment and written to disk. Further merges of disk segments may be triggered. NewIndexModifier extends IndexWriter and supports document deletion in addition to document addition. NewIndexModifier not only buffers newly added documents in ram, but also buffers deletes in ram. The following describes what happens when a flush is triggered: 1 merge ram documents into one segment and written to disk do not commit - segmentInfos is updated in memory, but not written to disk 2 for each disk segment to which a delete may apply open reader delete docs*, write new .delN file (* Care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized.) close reader, but do not commit - segmentInfos is updated in memory, but not written to disk 3 commit - write new segments_N to disk Further merges for disk segments work the same as before. As an option, we can cache readers to minimize the number of reader opens/closes. In other words, we can trade memory for better performance. The design would be modified as follows: 1 same as above 2 for each disk segment to which a delete may apply open reader and cache it if not already opened/cached delete docs*, write new .delN file 3 commit - write new segments_N to disk The logic for disk segment merge changes accordingly: open reader if not already opened/cached; after a merge is complete, close readers for the segments that have been merged. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, > perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12459490 ] Ning Li commented on LUCENE-565: Many versions of the patch were submitted as new code was committed to IndexWriter.java. For each version, all changes made were included in a single patch file. I removed all but the latest version of the patch. Even this one is outdated by the commit of LUCENE-701 (lock-less commits). I was waiting for the commit of LUCENE-702 before submitting another patch. LUCENE-702 was committed this morning. So I'll submit an up-to-date patch over the holidays. On 12/18/06, Paul Elschot (JIRA) <[EMAIL PROTECTED]> wrote: > I'd like to give this a try over the upcoming holidays. That's great! We can discuss/compare the designs then. Or, we can discuss/compare the designs before submitting new patches. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, > perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batc
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: newMergePolicy.Sept08.patch) > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, > perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: KeepDocCount0Segment.Sept15.patch) > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, > perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: TestWriterDelete.java) > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, > perf-test-res.JPG, perf-test-res2.JPG, perfres.log, > TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: NewIndexWriter.July18.patch) > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, > perf-test-res.JPG, perf-test-res2.JPG, perfres.log, > TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: NewIndexWriter.Aug23.patch) > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, > perf-test-res.JPG, perf-test-res2.JPG, perfres.log, > TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ---
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: NewIndexModifier.July09.patch) > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, > perf-test-res.JPG, perf-test-res2.JPG, perfres.log, > TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: IndexWriter.patch) > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, > NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, > newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIR
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: IndexWriter.July09.patch) > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, > NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, > newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - F
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: IndexWriter.java) > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, > NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, > newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12458205 ] Ning Li commented on LUCENE-565: > Minor question... in the places that you use Vector, is there a reason you > aren't using ArrayList? > And in methods that pass a Vector, that could be changed to a List . ArrayList and List can be used, respectively. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, > NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, > newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technol
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12458158 ] Ning Li commented on LUCENE-565: > Can the same thing happen with your patch (with a smaller window), or are > deletes applied between writing the new segment and writing the new segments > file that references it? (hard to tell from current diff in isolation) No, it does not happen with the patch, no matter what the window size is. This is because results of flushing ram - both inserts and deletes - are committed in the same transaction. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, > NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, > newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min >
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12457865 ] Ning Li commented on LUCENE-565: > *or* you could choose to do it before a merge of the lowest level on-disk > segments. If none of the lowest level segments have deletes, you could > even defer the deletes until after all the lowest-level segments have been > merged. This makes the deletes more efficient since it goes from > O(mergeFactor * log(maxBufferedDocs)) to O(log(mergeFactor*maxBufferedDocs)) I don't think I like this semantics, though. With the semantics in the patch, an update can be easily supported. With this semantics, an insert is flushed yet a delete before the insert may or may not have been flushed. > You are right that other forms of reader caching could increase the footprint, > but it's nice to have the option of trading some memory for performance. Agree. It'd be nice to cache all readers... :-) Thanks again for your comments. Enjoy your PTO! > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, > NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, > newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inser
[jira] Commented: (LUCENE-702) Disk full during addIndexes(Directory[]) can corrupt index
[ http://issues.apache.org/jira/browse/LUCENE-702?page=comments#action_12457858 ] Ning Li commented on LUCENE-702: > This is actually intentional: I don't want to write to the same > segments_N filename, ever, on the possibility that a reader may be > reading it. Admittedly, this should be quite rare (filling up disk > and then experiencing contention, only on Windows), but still I wanted > to keep "write once" even in this case. In IndexWriter, the rollbackTransaction call in commitTransaction could cause write to the same segment_N filename, right? The "write once" semantics is not kept for segment names or .delN. This is ok because no reader will read the old versions. > Disk full during addIndexes(Directory[]) can corrupt index > -- > > Key: LUCENE-702 > URL: http://issues.apache.org/jira/browse/LUCENE-702 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.1 >Reporter: Michael McCandless > Assigned To: Michael McCandless > Attachments: LUCENE-702.patch, LUCENE-702.take2.patch > > > This is a spinoff of LUCENE-555 > If the disk fills up during this call then the committed segments file can > reference segments that were not written. Then the whole index becomes > unusable. > Does anyone know of any other cases where disk full could corrupt the index? > I think disk full should worse lose the documents that were "in flight" at > the time. It shouldn't corrupt the index. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-702) Disk full during addIndexes(Directory[]) can corrupt index
[ http://issues.apache.org/jira/browse/LUCENE-702?page=comments#action_12457520 ] Ning Li commented on LUCENE-702: It looks good. My two cents: 1 In the two rollbacks in mergeSegments (where inTransaction is false), the segmentInfos' generation is not always rolled back. So something like this could happen: two consecutive successful commits write segments_3 and segments_5, respectively. Nothing is broken, but it'd be nice to roll back completely (even for the IndexWriter instance) when a commit fails. 2 Code serving two purposes are (and has been) mixed in mergeSegments: one to merge segments and create compound file if necessary, the other to commit or roll back when inTransaction is false. It'd be nice if the two could be separated: optimize and maybeMergeSegments call mergeSegmentsAndCommit, which creates a transaction, calls mergeSegments and commits or rolls back; mergeSegments doesn't deal with commit or rollback. However, currently the non-CFS version is committed first even if useCompoundFile is true. Until that's changed, mergeSegments probably has to continue serving both purposes. > Disk full during addIndexes(Directory[]) can corrupt index > -- > > Key: LUCENE-702 > URL: http://issues.apache.org/jira/browse/LUCENE-702 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.1 >Reporter: Michael McCandless > Assigned To: Michael McCandless > Attachments: LUCENE-702.patch > > > This is a spinoff of LUCENE-555 > If the disk fills up during this call then the committed segments file can > reference segments that were not written. Then the whole index becomes > unusable. > Does anyone know of any other cases where disk full could corrupt the index? > I think disk full should worse lose the documents that were "in flight" at > the time. It shouldn't corrupt the index. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12452039 ] Ning Li commented on LUCENE-565: With the recent commits to IndexWriter, this patch no longer applies cleanly. The 5 votes for this issue encourages me to submit yet another patch. :-) But before I do that, I'd like to briefly describe the design again and welcome all suggestions that help improve it and help get it committed. :-) With the new merge policy committed, the change to IndexWriter is minimal: three zero-or-one-line functions are added and used. 1 timeToFlushRam(): return true if number of ram segments >= maxBufferedDocs and used in maybeFlushRamSegments() 2 anythingToFlushRam(): return true if number of ram segments > 0 and used in flushRamSegments() 3 doAfterFlushRamSegments(): do nothing and called in mergeSegments() if the merge is on ram segments The new IndexModifier is a subclass of IndexWriter and only overwrites the three functions described above. 1 timeToFlushRam(): return true if number of ram segments >= maxBufferedDocs OR if number of buffered deletes >= maxBufferedDeletes 2 anythingToFlushRam(): return true if number of ram segments > 0 OR if number of buffered deletes > 0 3 doAfterFlushRamSegments(): properly flush buffered deletes The new IndexModifier supports all APIs from the current IndexModifier except one: deleteDocument(int doc). I had commented on this before: "I deliberately left that one out. This is because document ids are changing as documents are deleted and segments are merged. Users don't know exactly when segments are merged thus ids are changed when using IndexModifier." This behaviour is true for both the new IndexModifier and the current IndexModifier. If this is preventing this patch from getting accepted, I'm willing to add this, but I will detail this in the Java doc so users of this function are aware of this behaviour. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, > NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, > newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chap
[jira] Commented: (LUCENE-702) Disk full during addIndexes(Directory[]) can corrupt index
[ http://issues.apache.org/jira/browse/LUCENE-702?page=comments#action_12448006 ] Ning Li commented on LUCENE-702: > I think we should try to make all of the addIndexes calls (and more > generally any call to Lucene) "transactional". Agree. A transactional semantics would be better. The approach you described for three addIndexes looks good. addIndexes(IndexReader[]) is transactional but has two commits: one when existing segments are merged at the beginning, the other at the end when all segment/readers are merged. addIndexes(Directory[]) can be fixed to have a similar behaviour: first commit when existing segments are merged at the beginning, then at the end when all old/new segments are merged. addIndexesNoOptimize(Directory[]), on the other hand, does not merge existing segments at the beginning. So when fixed, it will only have one commit at the end which captures all the changes. > Disk full during addIndexes(Directory[]) can corrupt index > -- > > Key: LUCENE-702 > URL: http://issues.apache.org/jira/browse/LUCENE-702 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.1 >Reporter: Michael McCandless > Assigned To: Michael McCandless > > This is a spinoff of LUCENE-555 > If the disk fills up during this call then the committed segments file can > reference segments that were not written. Then the whole index becomes > unusable. > Does anyone know of any other cases where disk full could corrupt the index? > I think disk full should worse lose the documents that were "in flight" at > the time. It shouldn't corrupt the index. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12447657 ] Ning Li commented on LUCENE-565: [[ Old comment, sent by email on Thu, 6 Jul 2006 07:53:35 -0700 ]] Hi Otis, I will regenerate the patch and add more comments. :-) Regards, Ning "Otis Gospodnetic (JIRA)" <[EMAIL PROTECTED]> To [EMAIL PROTECTED] 07/05/2006 11:25 cc PM Subject [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) [ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12419396 ] Otis Gospodnetic commented on LUCENE-565: - I took a look at the patch and it looks good to me (anyone else had a look)? Unfortunately, I couldn't get the patch to apply :( $ patch -F3 < IndexWriter.patch (Stripping trailing CRs from patch.) patching file IndexWriter.java Hunk #1 succeeded at 58 with fuzz 1. Hunk #2 succeeded at 112 (offset 2 lines). Hunk #4 succeeded at 504 (offset 33 lines). Hunk #6 succeeded at 605 with fuzz 2 (offset 57 lines). missing header for unified diff at line 259 of patch (Stripping trailing CRs from patch.) can't find file to patch at input line 259 Perhaps you should have used the -p or --strip option? The text leading up to this was: ... ... ... File to patch: IndexWriter.java patching file IndexWriter.java Hunk #1 FAILED at 802. Hunk #2 succeeded at 745 with fuzz 2 (offset -131 lines). 1 out of 2 hunks FAILED -- saving rejects to file IndexWriter.java.rej Would it be possible for you to regenerate the patch against IndexWriter in HEAD? Also, I noticed ^Ms in the patch, but I can take care of those easily (dos2unix). Finally, I noticed in 2-3 places that the simple logging via "infoStream" variable was removed, for example: -if (infoStream != null) infoStream.print("merging segments"); Perhaps this was just an oversight? Looking forward to the new patch. Thanks! Provided) - a IndexWriter http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 to deleting version using batches. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, > NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, > newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to
[jira] Commented: (LUCENE-701) Lock-less commits
[ http://issues.apache.org/jira/browse/LUCENE-701?page=comments#action_12446656 ] Ning Li commented on LUCENE-701: > That wouldn't be considered a failure because it's part of the retry logic. > At that point, an attempt would be made to open seg_2. >From the description of the retry logic, I thought the retry logic only >applies to the loading of the "segments_N" file, but not to the entire process >of loading all the files of an index. You are right, it wouldn't be a failure if the retry logic is applied to the loading of all the files of an index. > Lock-less commits > - > > Key: LUCENE-701 > URL: http://issues.apache.org/jira/browse/LUCENE-701 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.1 >Reporter: Michael McCandless > Assigned To: Michael McCandless >Priority: Minor > Attachments: index.prelockless.cfs.zip, index.prelockless.nocfs.zip, > lockless-commits-patch.txt > > > This is a patch based on discussion a while back on lucene-dev: > > http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200608.mbox/[EMAIL > PROTECTED] > The approach is a small modification over the original discussion (see > Retry Logic below). It works correctly in all my cross-machine test > case, but I want to open it up for feedback, testing by > users/developers in more diverse environments, etc. > This is a small change to how lucene stores its index that enables > elimination of the commit lock entirely. The write lock still > remains. > Of the two, the commit lock has been more troublesome for users since > it typically serves an active role in production. Whereas the write > lock is usually more of a design check to make sure you only have one > writer against the index at a time. > The basic idea is that filenames are never reused ("write once"), > meaning, a writer never writes to a file that a reader may be reading > (there is one exception: the segments.gen file; see "RETRY LOGIC" > below). Instead it writes to generational files, ie, segments_1, then > segments_2, etc. Besides the segments file, the .del files and norm > files (.sX suffix) are also now generational. A generation is stored > as an "_N" suffix before the file extension (eg, _p_4.s0 is the > separate norms file for segment "p", generation 4). > One important benefit of this is it avoids files contents caching > entirely (the likely cause of errors when readers open an index > mounted on NFS) since the file is always a new file. > With this patch I can reliably instantiate readers over NFS when a > writer is writing to the index. However, with NFS, you are still forced to > refresh your reader once a writer has committed because "point in > time" searching doesn't work over NFS (see LUCENE-673 ). > The changes are fully backwards compatible: you can open an old index > for searching, or to add/delete docs, etc. I've added a new unit test > to test these cases. > All units test pass, and I've added a number of additional unit tests, > some of which fail on WIN32 in the current lucene but pass with this > patch. The "fileformats.xml" has been updated to describe the changes > to the files (but XXX references need to be fixed before committing). > There are some other important benefits: > * Readers are now entirely read-only. > * Readers no longer block one another (false contention) on > initialization. > * On hitting contention, we immediately retry instead of a fixed > (default 1.0 second now) pause. > * No file renaming is ever done. File renaming has caused sneaky > access denied errors on WIN32 (see LUCENE-665 ). (Yonik, I used > your approach here to not rename the segments_N file(try > segments_(N-1) on hitting IOException on segments_N): the separate > ".done" file did not work reliably under very high stress testing > when a directory listing was not "point in time"). > * On WIN32, you can now call IndexReader.setNorm() even if other > readers have the index open (fixes a pre-existing minor bug in > Lucene). > * On WIN32, You can now create an IndexWriter with create=true even > if readers have the index open (eg see > www.gossamer-threads.com/lists/lucene/java-user/39265) . > Here's an overview of the changes: > * Every commit writes to the next segments_(N+1). > * Loading the segments_N file (& opening the segments) now requires > retry logic. I've captured this logic into a new static class: > SegmentInfos.FindSegmentsFile. All places that need to do > something on the current segments file now use this class. > * No more deletable file. Instead, the writer computes what's > deletable on instantiation and updates this in memory whenever > files can be deleted (ie, when it commits). Created a common > class i
[jira] Commented: (LUCENE-701) Lock-less commits
[ http://issues.apache.org/jira/browse/LUCENE-701?page=comments#action_12446638 ] Ning Li commented on LUCENE-701: Can the following scenario happen with lock-less commits? 1 A reader reads segments.1, which says the index contains seg_1. 2 A writer writes segments.2, which says the index now contains seg_2, and deletes seg_1. 3 The reader tries to load seg_1 and fails. > Lock-less commits > - > > Key: LUCENE-701 > URL: http://issues.apache.org/jira/browse/LUCENE-701 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.1 >Reporter: Michael McCandless > Assigned To: Michael McCandless >Priority: Minor > Attachments: index.prelockless.cfs.zip, index.prelockless.nocfs.zip, > lockless-commits-patch.txt > > > This is a patch based on discussion a while back on lucene-dev: > > http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200608.mbox/[EMAIL > PROTECTED] > The approach is a small modification over the original discussion (see > Retry Logic below). It works correctly in all my cross-machine test > case, but I want to open it up for feedback, testing by > users/developers in more diverse environments, etc. > This is a small change to how lucene stores its index that enables > elimination of the commit lock entirely. The write lock still > remains. > Of the two, the commit lock has been more troublesome for users since > it typically serves an active role in production. Whereas the write > lock is usually more of a design check to make sure you only have one > writer against the index at a time. > The basic idea is that filenames are never reused ("write once"), > meaning, a writer never writes to a file that a reader may be reading > (there is one exception: the segments.gen file; see "RETRY LOGIC" > below). Instead it writes to generational files, ie, segments_1, then > segments_2, etc. Besides the segments file, the .del files and norm > files (.sX suffix) are also now generational. A generation is stored > as an "_N" suffix before the file extension (eg, _p_4.s0 is the > separate norms file for segment "p", generation 4). > One important benefit of this is it avoids files contents caching > entirely (the likely cause of errors when readers open an index > mounted on NFS) since the file is always a new file. > With this patch I can reliably instantiate readers over NFS when a > writer is writing to the index. However, with NFS, you are still forced to > refresh your reader once a writer has committed because "point in > time" searching doesn't work over NFS (see LUCENE-673 ). > The changes are fully backwards compatible: you can open an old index > for searching, or to add/delete docs, etc. I've added a new unit test > to test these cases. > All units test pass, and I've added a number of additional unit tests, > some of which fail on WIN32 in the current lucene but pass with this > patch. The "fileformats.xml" has been updated to describe the changes > to the files (but XXX references need to be fixed before committing). > There are some other important benefits: > * Readers are now entirely read-only. > * Readers no longer block one another (false contention) on > initialization. > * On hitting contention, we immediately retry instead of a fixed > (default 1.0 second now) pause. > * No file renaming is ever done. File renaming has caused sneaky > access denied errors on WIN32 (see LUCENE-665 ). (Yonik, I used > your approach here to not rename the segments_N file(try > segments_(N-1) on hitting IOException on segments_N): the separate > ".done" file did not work reliably under very high stress testing > when a directory listing was not "point in time"). > * On WIN32, you can now call IndexReader.setNorm() even if other > readers have the index open (fixes a pre-existing minor bug in > Lucene). > * On WIN32, You can now create an IndexWriter with create=true even > if readers have the index open (eg see > www.gossamer-threads.com/lists/lucene/java-user/39265) . > Here's an overview of the changes: > * Every commit writes to the next segments_(N+1). > * Loading the segments_N file (& opening the segments) now requires > retry logic. I've captured this logic into a new static class: > SegmentInfos.FindSegmentsFile. All places that need to do > something on the current segments file now use this class. > * No more deletable file. Instead, the writer computes what's > deletable on instantiation and updates this in memory whenever > files can be deleted (ie, when it commits). Created a common > class index.IndexFileDeleter shared by reader & writer, to manage > deletes. > * Storing more information into segments info file: whether it has > separate deletes (and which generatio
[jira] Commented: (LUCENE-702) Disk full during addIndexes(Directory[]) can corrupt index
[ http://issues.apache.org/jira/browse/LUCENE-702?page=comments#action_12446307 ] Ning Li commented on LUCENE-702: A possible solution to this issue is to check, when writing segment infos to "segments" in directory d, whether dir of a segment info is d, and only write if it is. Suggestions? The following is my comment on this issue from the mailing list documenting how Lucene could produce an inconsistent index if addIndexes(Directory[]) does not run to its completion. "This makes me notice a bug in current addIndexes(Directory[]). In current addIndexes(Directory[]), segment infos in S are added to T's "segmentInfos" upfront. Then segments in S are merged to T several at a time. Every merge is committed with T's "segmentInfos". So if a reader is opened on T while addIndexes(Directory[]) is going on, it could see an inconsistent index." > Disk full during addIndexes(Directory[]) can corrupt index > -- > > Key: LUCENE-702 > URL: http://issues.apache.org/jira/browse/LUCENE-702 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.1 >Reporter: Michael McCandless > Assigned To: Michael McCandless > > This is a spinoff of LUCENE-555 > If the disk fills up during this call then the committed segments file can > reference segments that were not written. Then the whole index becomes > unusable. > Does anyone know of any other cases where disk full could corrupt the index? > I think disk full should worse lose the documents that were "in flight" at > the time. It shouldn't corrupt the index. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-528) Optimization for IndexWriter.addIndexes()
[ http://issues.apache.org/jira/browse/LUCENE-528?page=all ] Ning Li updated LUCENE-528: --- Lucene Fields: [Patch Available] > Optimization for IndexWriter.addIndexes() > - > > Key: LUCENE-528 > URL: http://issues.apache.org/jira/browse/LUCENE-528 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Tamm > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: AddIndexes.patch, AddIndexesNoOptimize.patch > > > One big performance problem with IndexWriter.addIndexes() is that it has to > optimize the index both before and after adding the segments. When you have > a very large index, to which you are adding batches of small updates, these > calls to optimize make using addIndexes() impossible. It makes parallel > updates very frustrating. > Here is an optimized function that helps out by calling mergeSegments only on > the newly added documents. It will try to avoid calling mergeSegments until > the end, unless you're adding a lot of documents at once. > I also have an extensive unit test that verifies that this function works > correctly if people are interested. I gave it a different name because it > has very different performance characteristics which can make querying take > longer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-686) Resources not always reclaimed in scorers after each search
[ http://issues.apache.org/jira/browse/LUCENE-686?page=comments#action_12444766 ] Ning Li commented on LUCENE-686: But removing TermDocs.close() will leave IndexInput.close() in a similar half-in/half-out situation: e.g. close() will not be called for freqStream and skipStream in SegmentTermDocs. Yet IndexInput.close() cannot be removed (e.g. FSIndexInput). > Resources not always reclaimed in scorers after each search > --- > > Key: LUCENE-686 > URL: http://issues.apache.org/jira/browse/LUCENE-686 > Project: Lucene - Java > Issue Type: Bug > Components: Search > Environment: All >Reporter: Ning Li > Attachments: ScorerResourceGC.patch > > > Resources are not always reclaimed in scorers after each search. > For example, close() is not always called for term docs in TermScorer. > A test will be attached to show when resources are not reclaimed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-528) Optimization for IndexWriter.addIndexes()
[ http://issues.apache.org/jira/browse/LUCENE-528?page=all ] Ning Li updated LUCENE-528: --- Attachment: AddIndexesNoOptimize.patch This patch implements addIndexesNoOptimize() following the algorithm described earlier. - The patch is based on the latest version from trunk. - AddIndexesNoOptimize() is implemented. The algorithm description is included as comment and the code is commented. - The patch includes a test called TestAddIndexesNoOptimize which covers all the code in addIndexesNoOptimize(). - maybeMergeSegments() was conservative and checked for more merges only when "upperBound * mergeFactor <= maxMergeDocs". Change it to check for more merges when "upperBound < maxMergeDocs". - Minor changes in TestIndexWriterMergePolicy to better verify merge invariants. - The patch passes all unit tests. One more comment on the implementation: - When we copy un-merged segments from S in step 4, ideally, we want to simply copy those segments. However, directory does not support copy yet. In addition, source may use compound file or not and target may use compound file or not. So we use mergeSegments() to copy each segment, which may cause doc count to change because deleted docs are garbage collected. That case is handled properly. > Optimization for IndexWriter.addIndexes() > - > > Key: LUCENE-528 > URL: http://issues.apache.org/jira/browse/LUCENE-528 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Tamm > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: AddIndexes.patch, AddIndexesNoOptimize.patch > > > One big performance problem with IndexWriter.addIndexes() is that it has to > optimize the index both before and after adding the segments. When you have > a very large index, to which you are adding batches of small updates, these > calls to optimize make using addIndexes() impossible. It makes parallel > updates very frustrating. > Here is an optimized function that helps out by calling mergeSegments only on > the newly added documents. It will try to avoid calling mergeSegments until > the end, unless you're adding a lot of documents at once. > I also have an extensive unit test that verifies that this function works > correctly if people are interested. I gave it a different name because it > has very different performance characteristics which can make querying take > longer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-528) Optimization for IndexWriter.addIndexes()
[ http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12443978 ] Ning Li commented on LUCENE-528: I'll submit a patch next week. > Optimization for IndexWriter.addIndexes() > - > > Key: LUCENE-528 > URL: http://issues.apache.org/jira/browse/LUCENE-528 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Tamm > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: AddIndexes.patch > > > One big performance problem with IndexWriter.addIndexes() is that it has to > optimize the index both before and after adding the segments. When you have > a very large index, to which you are adding batches of small updates, these > calls to optimize make using addIndexes() impossible. It makes parallel > updates very frustrating. > Here is an optimized function that helps out by calling mergeSegments only on > the newly added documents. It will try to avoid calling mergeSegments until > the end, unless you're adding a lot of documents at once. > I also have an extensive unit test that verifies that this function works > correctly if people are interested. I gave it a different name because it > has very different performance characteristics which can make querying take > longer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-528) Optimization for IndexWriter.addIndexes()
[ http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12443911 ] Ning Li commented on LUCENE-528: > I think you need to ensure that no segments from the source index "S" remain > after the call, right? Correct. And thanks! So in step 4, in the case where the invariants hold for the last < M segments whose levels <= h, if some of those < M segments are from S (not merged in step 3), properly copy them over. Algorithm looks good? This makes me notice a bug in current addIndexes(Directory[]). In current addIndexes(Directory[]), segment infos in S are added to T's "segmentInfos" upfront. Then segments in S are merged to T several at a time. Every merge is committed with T's "segmentInfos". So if a reader is opened on T while addIndexes() is going on, it could see an inconsistent index. > Optimization for IndexWriter.addIndexes() > - > > Key: LUCENE-528 > URL: http://issues.apache.org/jira/browse/LUCENE-528 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Tamm > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: AddIndexes.patch > > > One big performance problem with IndexWriter.addIndexes() is that it has to > optimize the index both before and after adding the segments. When you have > a very large index, to which you are adding batches of small updates, these > calls to optimize make using addIndexes() impossible. It makes parallel > updates very frustrating. > Here is an optimized function that helps out by calling mergeSegments only on > the newly added documents. It will try to avoid calling mergeSegments until > the end, unless you're adding a lot of documents at once. > I also have an extensive unit test that verifies that this function works > correctly if people are interested. I gave it a different name because it > has very different performance characteristics which can make querying take > longer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-528) Optimization for IndexWriter.addIndexes()
[ http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12443723 ] Ning Li commented on LUCENE-528: We want a robust algorithm for the version of addIndexes() which does not call optimize(). The robustness can be expressed as the two invariants guaranteed by the merge policy for adding documents (if mergeFactor M does not change and segment doc count is not reaching maxMergeDocs): B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B))) 1: If i (left*) and i+1 (right*) are two consecutive segments of doc counts x and y, then f(x) >= f(y). 2: The number of committed segments on the same level (f(n)) <= M. References are at http://www.gossamer-threads.com/lists/lucene/java-dev/35147, LUCENE-565 and LUCENE-672. AddIndexes() can be viewed as adding a sequence of segments S to a sequence of segments T. Segments in T follow the invariants but segments in S may not since they could come from multiple indexes. Here is the merge algorithm for addIndexes(): 1. Flush ram segments. 2. Consider a combined sequence with segments from T followed by segments from S (same as current addIndexes()). 3. Assume the highest level for segments in S is h. Call maybeMergeSegments(), but instead of starting w/ lowerBound = -1 and upperBound = maxBufferedDocs, start w/ lowerBound = -1 and upperBound = upperBound of level h. After this, the invariants are guaranteed except for the last < M segments whose levels <= h. 4. If the invariants hold for the last < M segments whose levels <= h, done. Otherwise, simply merge those segments. If the merge results in a segment of level <= h, done. Otherwise, it's of level h+1 and call maybeMergeSegments() starting w/ upperBound = upperBound of level h+1. Suggestions? > Optimization for IndexWriter.addIndexes() > - > > Key: LUCENE-528 > URL: http://issues.apache.org/jira/browse/LUCENE-528 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Tamm > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: AddIndexes.patch > > > One big performance problem with IndexWriter.addIndexes() is that it has to > optimize the index both before and after adding the segments. When you have > a very large index, to which you are adding batches of small updates, these > calls to optimize make using addIndexes() impossible. It makes parallel > updates very frustrating. > Here is an optimized function that helps out by calling mergeSegments only on > the newly added documents. It will try to avoid calling mergeSegments until > the end, unless you're adding a lot of documents at once. > I also have an extensive unit test that verifies that this function works > correctly if people are interested. I gave it a different name because it > has very different performance characteristics which can make querying take > longer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-686) Resources not always reclaimed in scorers after each search
[ http://issues.apache.org/jira/browse/LUCENE-686?page=comments#action_12442987 ] Ning Li commented on LUCENE-686: > Is there an actual memory leak problem related to this? Right now no. For example, in FS based directories, the index inputs term docs use are clones. Close() of cloned index inputs does not close the file descriptor. Only the origianl one does. However, memory leak could happen to a new subclass of directory and index input, if cloned instances require reclaiming resources. In addition, memory leak could happen to a new subclass of scorer, if there are resources associated with the scorer which should be reclaimed when done. > In ReqExclScorer the two scorers can also be closed when they are set to > null. Thanks for pointing this out. I'll double check all scorers and make sure close() are properly called. > It's probably better to use try/finally in IndexSearcher and call close in in > the finally clause, > exceptions are occasionally used to preliminary end a search, although not in > the > lucene core afaik. Will do. Thanks again! Cheers, Ning > Resources not always reclaimed in scorers after each search > --- > > Key: LUCENE-686 > URL: http://issues.apache.org/jira/browse/LUCENE-686 > Project: Lucene - Java > Issue Type: Bug > Components: Search > Environment: All >Reporter: Ning Li > Attachments: ScorerResourceGC.patch > > > Resources are not always reclaimed in scorers after each search. > For example, close() is not always called for term docs in TermScorer. > A test will be attached to show when resources are not reclaimed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-686) Resources not always reclaimed in scorers after each search
[ http://issues.apache.org/jira/browse/LUCENE-686?page=all ] Ning Li updated LUCENE-686: --- Attachment: ScorerResourceGC.patch A patch is attached: - The patch is based on the lastest version from trunk. - The patch includes a test called TestScorerResourceGJ which shows resources are not reclaimed after each search without the patch. - The patch passes TestScorerResourcesGJ. - The patch passes all the unit tests. > Resources not always reclaimed in scorers after each search > --- > > Key: LUCENE-686 > URL: http://issues.apache.org/jira/browse/LUCENE-686 > Project: Lucene - Java > Issue Type: Bug > Components: Search > Environment: All >Reporter: Ning Li > Attachments: ScorerResourceGC.patch > > > Resources are not always reclaimed in scorers after each search. > For example, close() is not always called for term docs in TermScorer. > A test will be attached to show when resources are not reclaimed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-686) Resources not always reclaimed in scorers after each search
Resources not always reclaimed in scorers after each search --- Key: LUCENE-686 URL: http://issues.apache.org/jira/browse/LUCENE-686 Project: Lucene - Java Issue Type: Bug Components: Search Environment: All Reporter: Ning Li Resources are not always reclaimed in scorers after each search. For example, close() is not always called for term docs in TermScorer. A test will be attached to show when resources are not reclaimed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: NewIndexModifier.Sept21.patch This is to update the delete-support patch after the commit of the new merge policy. - Very few changes to IndexWriter. - The patch passes all tests. - A new test call TestNewIndexModifierDelete is added to show different scenarios when using delete methods in NewIndexModifier. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, > NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, > newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interlea
[jira] Commented: (LUCENE-672) new merge policy
[ http://issues.apache.org/jira/browse/LUCENE-672?page=comments#action_12435571 ] Ning Li commented on LUCENE-672: > Should lowerBound start off as -1 in maybeMergeSegments if we keep 0 sized > segments? Good catch! Although the rightmost disk segment cannot be a 0-sized segment right now, it could be when NewIndexModifier is in. Shoud I submit a new patch? > new merge policy > > > Key: LUCENE-672 > URL: http://issues.apache.org/jira/browse/LUCENE-672 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.0.0 >Reporter: Yonik Seeley > Assigned To: Yonik Seeley > Fix For: 2.1 > > > New merge policy developed in the course of > http://issues.apache.org/jira/browse/LUCENE-565 > http://issues.apache.org/jira/secure/attachment/12340475/newMergePolicy.Sept08.patch -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-672) new merge policy
[ http://issues.apache.org/jira/browse/LUCENE-672?page=comments#action_12435174 ] Ning Li commented on LUCENE-672: A small fix named KeepDocCount0Segment.Sept15.patch is attached to LUCENE-565 (can't attach here). In mergeSegments(...), if the doc count of a merged segment is 0, it is not added to the index (it should be properly cleaned up). Before LUCENE-672, a merged segment was always added to the index. The use of mergeSegments(...) in, e.g. addIndexes(Directory[]), assumed that behaviour. For code simplicity, this fix restores the old behaviour that a merged segment is always added to the index. This does NOT break any of the good properties of the new merge policy. TestIndexWriterMergePolicy is slightly modified to fix a bug and to check that segments are probably cleaned up. The patch passes all the tests. > new merge policy > > > Key: LUCENE-672 > URL: http://issues.apache.org/jira/browse/LUCENE-672 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.0.0 >Reporter: Yonik Seeley > Assigned To: Yonik Seeley > Fix For: 2.1 > > > New merge policy developed in the course of > http://issues.apache.org/jira/browse/LUCENE-565 > http://issues.apache.org/jira/secure/attachment/12340475/newMergePolicy.Sept08.patch -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: newMergePolicy.Sept08.patch This patch features the new more robust merge policy. Reference on the new policy is at http://www.gossamer-threads.com/lists/lucene/java-dev/35147 - The patch passes all the tests except that one in TestIndexModifier (see an earlier comment on this issue). - Since the test itself has a problem, it is fixed (one line change) and the patch passes the fixed test. - A new test call TestIndexWriterMergePolicy is included which shows the robustness of the new merge policy. The following is a detailed description of the new merge policy and its properties. Overview of merge policy: A flush is triggered either by close() or by the number of ram segments reaching maxBufferedDocs. After a disk segment is created by the flush, further merges may be triggered. LowerBound and upperBound set the limits on the doc count of a segment which may be merged. Initially, lowerBound is set to 0 and upperBound to maxBufferedDocs. Starting from the rightmost* segment whose doc count > lowerBound and <= upperBound, count the number of consecutive segments whose doc count <= upperBound. Case 1: number of worthy segments < mergeFactor, no merge, done. Case 2: number of worthy segments == mergeFactor, merge these segments. If the doc count of the merged segment <= upperBound, done. Otherwise, set lowerBound to upperBound, and multiply upperBound by mergeFactor, go through the process again. Case 3: number of worthy segments > mergeFactor (in the case mergeFactor M changes), merge the leftmost* M segments. If the doc count of the merged segment <= upperBound, consider the merged segment for further merges on this same level. Merge the now leftmost* M segments, and so on, until number of worthy segments < mergeFactor. If the doc count of all the merged segments <= upperBound, done. Otherwise, set lowerBound to upperBound, and multiply upperBound by mergeFactor, go through the process again. Note that case 2 can be considerd as a special case of case 3. This merge policy guarantees two invariants if M does not change and segment doc count is not reaching maxMergeDocs: B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B))) 1: If i (left*) and i+1 (right*) are two consecutive segments of doc counts x and y, then f(x) >= f(y). 2: The number of committed segments on the same level (f(n)) <= M. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, > NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, > perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, > TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by bu
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12430130 ] Ning Li commented on LUCENE-565: Doron, thank you very much for the review! I want to briefly comment on one of your comments: > (5) deleteDocument(int doc) not implemented I deliberately left that one out. This is because document ids are changing as documents are deleted and segments are merged. Users don't know exactly when segments are merged thus ids are changed when using IndexModifier. Thus I don't think it should be supported in IndexModifier at all. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, > NewIndexWriter.July18.patch, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > N
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: NewIndexWriter.Aug23.patch > Yes I am including this patch as it is very useful for increasing > the efficiency of updates as you described. I will be conducting > more tests and will post any results. Yes a patch for IndexWriter > will be useful so that the entirety of this build will work. > Thanks! I've attached a patch that works with the current code. The implementation of IndexWriter and NewIndexModifier is the same as the last patch. I removed the "singleDocSegmentsCount" optimization from this patch since my IndexWriter checks singleDocSegmentsCount by simply calling ramSegmentInfos.size(). This patch had evolved with the help of many good discussions (thanks!) since it came out in May. Here is the current state of the patch: - This patch aims at enabling users to do inserts and general deletes (delete-by-term, and later delete-by-query) without switching between writers and readers. - The goal is achieved by rewritting IndexWriter in such a way that semantically it's the same as before, but it provides extension points so that delete-by-term, delete-by-query, and more functionalities can be easily supported in a subclass. - NewIndexModifier extends IndexWriter and supports delete-by-term by simply overriding two methods: toFlushRamSegment() which decides if a flush should happen, and doAfterFlushRamSegments() which does proper work after a flush is done. Suggestions are welcome! Especially those that may help it get committed. :-) > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, > NewIndexWriter.July18.patch, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A
[jira] Commented: (LUCENE-528) Optimization for IndexWriter.addIndexes()
[ http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12428478 ] Ning Li commented on LUCENE-528: In an email thread titled "LUCENE-528 and 565", I described a weakness of the proposed solution: "I'm totally for a version of addIndexes() where optimize() is not always called. However, with the one proposed in the patch, we could end up with an index where: segment 0 has 1000 docs, 1 has 2000, 2 has 4000, 3 has 8000, etc. while Lucene desires the reverse. Or we could have a sandwich index where: segment 0 has 4000 docs, 1 has 100, 2 has 100, 3 has 4000. While neither of these will occur if you use addIndexesNoOpt() carefully, there should be a more robust merge policy." Here is an alternative solution which merges segements so that the docCount of segment i is at least twice as big as the docCount of segment i+1. If we are willing to make it a bit more complicated, we can take merge factor into consideration. public synchronized void addIndexesNoOpt(Directory[] dirs) throws IOException { for (int i = 0; i < dirs.length; i++) { SegmentInfos sis = new SegmentInfos(); // read infos from dir sis.read(dirs[i]); for (int j = 0; j < sis.size(); j++) { segmentInfos.addElement(sis.info(j)); // add each info } } int start = 0; int docCountFromStart = docCount(); while (start < segmentInfos.size()) { int end; int docCountToMerge = 0; if (docCountFromStart <= minMergeDocs) { // if the total docCount of the remaining segments // is lte minMergeDocs, merge all of them end = segmentInfos.size() - 1; docCountToMerge = docCountFromStart; } else { // otherwise, merge some segments so that the docCount // of these segments is at least half of the remaining for (end = start; end < segmentInfos.size(); end++) { docCountToMerge += segmentInfos.info(end).docCount; if (docCountToMerge >= docCountFromStart / 2) { break; } } } mergeSegments(start, end + 1); start++; docCountFromStart -= docCountToMerge; } } > Optimization for IndexWriter.addIndexes() > - > > Key: LUCENE-528 > URL: http://issues.apache.org/jira/browse/LUCENE-528 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Tamm > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: AddIndexes.patch > > > One big performance problem with IndexWriter.addIndexes() is that it has to > optimize the index both before and after adding the segments. When you have > a very large index, to which you are adding batches of small updates, these > calls to optimize make using addIndexes() impossible. It makes parallel > updates very frustrating. > Here is an optimized function that helps out by calling mergeSegments only on > the newly added documents. It will try to avoid calling mergeSegments until > the end, unless you're adding a lot of documents at once. > I also have an extensive unit test that verifies that this function works > correctly if people are interested. I gave it a different name because it > has very different performance characteristics which can make querying take > longer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: NewIndexWriter.July18.patch Hopefully, third time's a charm. :-) I rewrote IndexWriter in such a way that semantically it's the same as before, but it provides extension points so that delete-by-term, delete-by-query, and more functionalities can be easily supported in a subclass. NewIndexModifier is such a subclass that supports delete-by-term. Here is an overview of the changes: Changes to IndexWriter Changes to IndexWriter variables: - segmentInfos used to store the info of all segments (on disk or in ram). Now it only stores the info of segments on disk. - ramSegmentInfos is a new variable which stores the info of just ram segments. Changes to IndexWriter methods: - addDocument() The info of the new ram segment is added to ramSegmentInfos. - maybeMergeSegments() toFlushRamSegments() is called at the beginning to decide whether a flush should take place. - flushRamSegments() doAfterFlushRamSegments() is called after all ram segments are merged and flushed to disk. NewIndexModifier New variables: - bufferedDeleteTerms is a new variable which buffers delete terms before they are applied. - maxBufferedDeleteTerms is similar to maxBufferedDocs. It controls the max number of delete terms that can be buffered before they must be flushed to disk. Overloaded/new methods: - deleteDocuments(), batchDeleteDocuments() The terms are added to bufferedDeleteTerms. bufferedDeleteTerms also records the current number of documents buffered in ram, so the delete terms can be applied to ram segments as well as the segments on disk. - toFlushRamSegments() In IndexWriter, a flush would be triggered only if enough documents were buffered. Now a flush is triggered if enough documents are buffered OR if enough delete terms are buffered. - doAfterlushRamSegments() Step 1: Apply buffered delete terms to all the segments on disk. Step 2: Apply buffered delete terms to the new segment appropriately, so that a delete term is only applied to the documents buffered before it, but not to those buffered after it. Step 3: Clean up the buffered delete terms. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, NewIndexModifier.July09.patch, > NewIndexWriter.July18.patch, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: IndexWriter.July09.patch NewIndexModifier.July09.patch Hi Otis, I've attached two patch files: - IndexWriter.July09.patch is an updated version of the old patch. - NewIndexModifier.July09.patch makes minimal changes to IndexWriter and puts new functionalities in a new class called NewIndexModifier. I didn't name it IndexModifier because the two are unrelated and I don't want a diff of the two. All unit test succeeded except the following one: [junit] Testcase: testIndex(org.apache.lucene.index.TestIndexModifier): FAILED [junit] expected:<3> but was:<4> [junit] junit.framework.AssertionFailedError: expected:<3> but was:<4> [junit] at org.apache.lucene.index.TestIndexModifier.testIndex(TestIndexModifier.java:67) However, the unit test has a problem, not the patch: IndexWriter's docCount() does not tell the actual number of documents in an index, only IndexReader's numDocs() does. For example, in a similar test below, where 10 documents are added, then 1 deleted, then 2 added, the last call to docCount() returns 12, not 11, with or without the patch. public void testIndexSimple() throws IOException { Directory ramDir = new RAMDirectory(); IndexModifier i = new IndexModifier(ramDir, new StandardAnalyzer(), true); // add 10 documents initially for (int count = 0; count < 10; count++) { i.addDocument(getDoc()); } i.flush(); i.optimize(); assertEquals(10, i.docCount()); i.deleteDocument(0); i.flush(); assertEquals(9, i.docCount()); i.addDocument(getDoc()); i.addDocument(getDoc()); i.flush(); assertEquals(12, i.docCount()); } The reason for the docCount() difference in the unit test (which does not affect the correctness of the patch) is that flushRamSegments() in the patch merges all and only the segments in ram and write to disk, whereas the original flushRamSegments() merges not only the segments in ram but *sometimes* also one segment from disk (see in that function the comment "// add one FS segment?"). Regards, Ning > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Type: Bug > Components: Index > Reporter: Ning Li > Attachments: IndexWriter.July09.patch, IndexWriter.java, IndexWriter.patch, > NewIndexModifier.July09.patch, TestWriterDelete.java > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Resul