[jira] Commented: (LUCENE-1541) Trie range - make trie range indexing more flexible

2009-03-02 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678045#action_12678045
 ] 

Ning Li commented on LUCENE-1541:
-

An index size comparison will be great.

> Trie range - make trie range indexing more flexible
> ---
>
> Key: LUCENE-1541
> URL: https://issues.apache.org/jira/browse/LUCENE-1541
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Ning Li
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1541.patch, LUCENE-1541.patch
>
>
> In the current trie range implementation, a single precision step is 
> specified. With a large precision step (say 8), a value is indexed in fewer 
> terms (8) but the number of terms for a range can be large. With a small 
> precision step (say 2), the number of terms for a range is smaller but a 
> value is indexed in more terms (32).
> We want to add an option that different precision steps can be set for 
> different precisions. An expert can use this option to keep the number of 
> terms for a range small and at the same time index a value in a small number 
> of terms. See the discussion in LUCENE-1470 that results in this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1541) Trie range - make trie range indexing more flexible

2009-02-20 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675390#action_12675390
 ] 

Ning Li commented on LUCENE-1541:
-

When one precision step is given, it is converted to the representation. Then 
no array creation is necessary. But something like TrieUtils.FieldConfiguration 
would be better. Besides the field name and the precision steps, either it 
should also contain a type (long/int) or a subclass is created for each type. 
It can be used both at indexing time and at query time.

> Trie range - make trie range indexing more flexible
> ---
>
> Key: LUCENE-1541
> URL: https://issues.apache.org/jira/browse/LUCENE-1541
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Ning Li
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1541.patch
>
>
> In the current trie range implementation, a single precision step is 
> specified. With a large precision step (say 8), a value is indexed in fewer 
> terms (8) but the number of terms for a range can be large. With a small 
> precision step (say 2), the number of terms for a range is smaller but a 
> value is indexed in more terms (32).
> We want to add an option that different precision steps can be set for 
> different precisions. An expert can use this option to keep the number of 
> terms for a range small and at the same time index a value in a small number 
> of terms. See the discussion in LUCENE-1470 that results in this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1541) Trie range - make trie range indexing more flexible

2009-02-19 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675212#action_12675212
 ] 

Ning Li commented on LUCENE-1541:
-

If you are *really* concerned with the additional loop and the additional array 
allocations, a long can be used to represent the precision steps. For example, 
precision steps 2-2-2-2-8-8-8-8-8-16 are represented as 0x80008080808080aa. 
Then bitCount, shift and numberOfTrailingZeros can be used to determine the 
length of the trie array and the individual precision steps. Hmm, we still have 
to support Java 1.4?

> Trie range - make trie range indexing more flexible
> ---
>
> Key: LUCENE-1541
> URL: https://issues.apache.org/jira/browse/LUCENE-1541
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Ning Li
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1541.patch
>
>
> In the current trie range implementation, a single precision step is 
> specified. With a large precision step (say 8), a value is indexed in fewer 
> terms (8) but the number of terms for a range can be large. With a small 
> precision step (say 2), the number of terms for a range is smaller but a 
> value is indexed in more terms (32).
> We want to add an option that different precision steps can be set for 
> different precisions. An expert can use this option to keep the number of 
> terms for a range small and at the same time index a value in a small number 
> of terms. See the discussion in LUCENE-1470 that results in this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1541) Trie range - make trie range indexing more flexible

2009-02-17 Thread Ning Li (JIRA)
Trie range - make trie range indexing more flexible
---

 Key: LUCENE-1541
 URL: https://issues.apache.org/jira/browse/LUCENE-1541
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Ning Li
Priority: Minor


In the current trie range implementation, a single precision step is specified. 
With a large precision step (say 8), a value is indexed in fewer terms (8) but 
the number of terms for a range can be large. With a small precision step (say 
2), the number of terms for a range is smaller but a value is indexed in more 
terms (32).

We want to add an option that different precision steps can be set for 
different precisions. An expert can use this option to keep the number of terms 
for a range small and at the same time index a value in a small number of 
terms. See the discussion in LUCENE-1470 that results in this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1470) Add TrieRangeFilter to contrib

2009-02-17 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674248#action_12674248
 ] 

Ning Li commented on LUCENE-1470:
-

Agree. Do you want to open a new issue? If you want, I can take a crack at it, 
but probably sometime next week.

> Add TrieRangeFilter to contrib
> --
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470-readme.patch, LUCENE-1470-revamp.patch, LUCENE-1470-revamp.patch, 
> LUCENE-1470-revamp.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, trie.zip, TrieRangeFilter.java, TrieUtils.java, 
> TrieUtils.java, TrieUtils.java, TrieUtils.java, TrieUtils.java
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1470) Add TrieRangeFilter to contrib

2009-02-16 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674051#action_12674051
 ] 

Ning Li commented on LUCENE-1470:
-

Hi Uwe,

I had something similar in mind when I said we can "make things more flexible". 
Do you think it'll be too complex for users to specify? On the other hand, this 
is for experts so let experts have all the flexibility. :) We can open a 
different JIRA issue if we decide to go for it.

> Add TrieRangeFilter to contrib
> --
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470-readme.patch, LUCENE-1470-revamp.patch, LUCENE-1470-revamp.patch, 
> LUCENE-1470-revamp.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, trie.zip, TrieRangeFilter.java, TrieUtils.java, 
> TrieUtils.java, TrieUtils.java, TrieUtils.java, TrieUtils.java
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1470) Add TrieRangeFilter to contrib

2009-02-16 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673912#action_12673912
 ] 

Ning Li commented on LUCENE-1470:
-

Good stuff!

Is it worth to also have an option to specify the number of precisions to index 
a value?

With a large precision step (say 8), a value is indexed in fewer terms (8) but 
the number of terms for a range can be large. With a small precision step (say 
2), the number of terms for a range is smaller but a value is indexed in more 
terms (32). With precision step 2 and number of precisions set to 24, the 
number of terms for a range is still quite small but a value is indexed in 24 
terms instead of 32. For applications usually query small ranges, the number of 
precisions can be further reduced.

We can provide more options to make things more flexible. But we probably want 
a balance of flexibility vs. the complexity of user options. Does this number 
of precisions look like a good one?

> Add TrieRangeFilter to contrib
> --
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470-readme.patch, LUCENE-1470-revamp.patch, LUCENE-1470-revamp.patch, 
> LUCENE-1470-revamp.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, trie.zip, TrieRangeFilter.java, TrieUtils.java, 
> TrieUtils.java, TrieUtils.java, TrieUtils.java, TrieUtils.java
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

2008-09-03 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628025#action_12628025
 ] 

Ning Li commented on LUCENE-532:


Is the use of seek and write in ChecksumIndexOutput making Lucene less likely 
to support all sequential write (i.e. no seek write)? ChecksumIndexOutput is 
currently used by SegmentInfos.

> [PATCH] Indexing on Hadoop distributed file system
> --
>
> Key: LUCENE-532
> URL: https://issues.apache.org/jira/browse/LUCENE-532
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.9
>Reporter: Igor Bolotin
>Priority: Minor
> Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, 
> TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on 
> Hadoop distributed file system. When we tried to do it directly on DFS using 
> Nutch FsDirectory class - we immediately found that indexing fails because 
> DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason 
> for this behavior is clear - DFS does not support random updates and so 
> seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need 
> them? Search in the Lucene code revealed 2 places which call 
> IndexOutput.seek() method: one is in TermInfosWriter and another one in 
> CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the 
> only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total 
> number of terms in the file back into the beginning of the file. It was very 
> simple to change file format a little bit and write number of terms into last 
> 8 bytes of the file instead of writing them into beginning of file. The only 
> other place that should be fixed in order for this to work is in 
> SegmentTermEnum constructor - to read this piece of information at position = 
> file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index 
> directly to DFS without any problems. Well - we still don't index directly to 
> DFS for performance reasons, but at least we can build small local indexes 
> and merge them into the main index on DFS without copying big main index back 
> and forth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit

2008-08-27 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626158#action_12626158
 ] 

Ning Li commented on LUCENE-1335:
-

Maybe this should be a separate JIRA issue. In doWait(), the comment says "as a 
defense against thread timing hazards where notifyAll() falls to be called, we 
wait for at most 1 second..." In some cases, it seems that notifyAll() simply 
isn't called, such as some of the cases related to runningMerges. Maybe we 
should take a closer look at and possibly simplify the concurrency control in 
IndexWriter, especially when autoCommit is disabled?

> Correctly handle concurrent calls to addIndexes, optimize, commit
> -
>
> Key: LUCENE-1335
> URL: https://issues.apache.org/jira/browse/LUCENE-1335
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3, 2.3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1335.patch, LUCENE-1335.patch, LUCENE-1335.patch, 
> LUCENE-1335.patch, LUCENE-1335.patch
>
>
> Spinoff from here:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit

2008-08-25 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625455#action_12625455
 ] 

Ning Li commented on LUCENE-1335:
-

> I don't think so: with autoCommit=true, the merges calls commit(long)
> after finishing, and I think we want those commit calls to run
> concurrently?

After we disable autoCommit, all commit calls will be serialized, right?


> What'll happen is the BG merge will hit an exception, roll itself
> back, and then the FG thread will pick up the merge and try again.
> Likely it'll hit the same exception, which is then thrown back to the
> caller.  It may not hit an exception, eg say it was disk full: the BG
> merge was probably trying to merge 10 segments, whereas the FG merge
> is just copying over the 1 segment.  So it may complete successfully
> too.

Back to the issue of running an external merge in BG or FG.
In ConcurrentMergeScheduler.merge, an external merge is run in FG,
not in BG. But in ConcurrentMergeScheduler.MergeThread.run,
whether a merge is external is no longer checked. Why this difference?


> Correctly handle concurrent calls to addIndexes, optimize, commit
> -
>
> Key: LUCENE-1335
> URL: https://issues.apache.org/jira/browse/LUCENE-1335
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3, 2.3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1335.patch, LUCENE-1335.patch, LUCENE-1335.patch, 
> LUCENE-1335.patch
>
>
> Spinoff from here:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit

2008-08-23 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625078#action_12625078
 ] 

Ning Li commented on LUCENE-1335:
-

> It's because commit() calls prepareCommit(), which throws a
> "prepareCommit was already called" exception if the commit was already
> prepared.  Whereas commit(long) doesn't call prepareCommit (eg, it
> doesn't need to flush).  Without this, I was hitting exceptions in one
> of the tests that calls commit() from multiple threads at the same
> time.

Is it better to simplify things by serializing all commit()/commit(long) calls?

> This is to make sure any just-started addIndexes cleanly finish or
> abort before we enter the wait loop.  I was seeing cases where the
> wait loop would think no more merges were pending, but in fact an
> addIndexes was just getting underway and was about to start merging.
> It's OK if a new addIndexes call starts up, because it'll be forced to
> check the stop conditions (closing=true or stopMerges=true) and then
> abort the merges.  I'll add comments to this effect.

I wonder if we can simplify the logic... Currently in setMergeScheduler,
merges can start between finishMerges and set the merge scheduler.
This one can be fixed by making setMergeScheduler synchronized.

> This method has always carried out merges in the FG, but it's in fact
> possible that a BG merge thread on finishing a previous merge may pull
> a merge involving external segments.  So I changed this method to wait
> for all such BG merges to complete, because it's not allowed to return
> until there are no more external segments in the index.

Hmm... so merges involving external segments may be in FG or BG?
So copyExternalSegments not only copies external segments, but also
waits for BG merges involving external segments to finish. We need
a better name?

> It is tempting to fully schedule these external merges (ie allow them
> to run in BG), but there is a problem: if there is some error on doing
> the merge, we need that error to be thrown in the FG thread calling
> copyExternalSegments (so the transcaction above unwinds).  (Ie we
> can't just stuff these external merges into the merge queue then wait
> for their completely).

Then what about those BG merges involving external segments?

> Correctly handle concurrent calls to addIndexes, optimize, commit
> -
>
> Key: LUCENE-1335
> URL: https://issues.apache.org/jira/browse/LUCENE-1335
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3, 2.3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1335.patch, LUCENE-1335.patch, LUCENE-1335.patch, 
> LUCENE-1335.patch
>
>
> Spinoff from here:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit

2008-08-22 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624998#action_12624998
 ] 

Ning Li commented on LUCENE-1335:
-

I agree that we should not make any API promises about what
it means when the methods (commit, close, rollback, optimize,
addIndexes) are called concurrently from different threads.
The discussion below is on their current behavior.

> Only one addIndexes can run at once, so call to 2nd or more
> addIndexes just blocks until the one is done.

This is achieved by the read-write lock.

> close() and rollback() wait for any running addIndexes to finish
> and then blocks new addIndexes calls

Just to clarify: close(waitForMerges=false) and rollback() make
an ongoing addIndexes[NoOptimize](dirs) abort, but wait for
addIndexes(readers) to finish. It'd be nice if they make any
addIndexes* abort for a quick shutdown, but that's for later.

> commit() waits for any running addIndexes, or any already running
> commit, to finish, then quickly takes a snapshot of the segments
> and syncs the files referenced by that snapshot. While syncing is
> happening addIndexes are then allowed to run again.

commit() and commit(long) use the read-write lock to wait for
a running addIndexes. "committing" is used to serialize commit()
calls. Why isn't it also used to serialize commit(long) calls?

> optimize() is allowed to run concurrently with addIndexes; the two
> simply wait for their respective merges to finish.

This is nice.

More detailed comments:
- In finishMerges, acquireRead and releaseRead are both called.
  Isn't addIndexes allowed again?

- In copyExternalSegments, merges involving external segments
  are carried out in foreground. So why the changes? To relax
  that assumption? But other part still makes the assumption.

- addIndexes(readers) should optimize before startTransaction, no?

- The newly added method segString(dir) in SegmentInfos is
  not used anywhere.

> Correctly handle concurrent calls to addIndexes, optimize, commit
> -
>
> Key: LUCENE-1335
> URL: https://issues.apache.org/jira/browse/LUCENE-1335
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3, 2.3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1335.patch, LUCENE-1335.patch, LUCENE-1335.patch
>
>
> Spinoff from here:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit

2008-08-22 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624851#action_12624851
 ] 

Ning Li commented on LUCENE-1335:
-

Hi Mike, could you update the patch? I cannot apply it. Thanks!

> Correctly handle concurrent calls to addIndexes, optimize, commit
> -
>
> Key: LUCENE-1335
> URL: https://issues.apache.org/jira/browse/LUCENE-1335
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3, 2.3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1335.patch, LUCENE-1335.patch
>
>
> Spinoff from here:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1338) With non-deprecated constructors, IndexWriter's autoCommit is always true

2008-07-17 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li resolved LUCENE-1338.
-

Resolution: Invalid

When deprecated constructors are removed in 3.0, autoCommit will always be 
false.

> With non-deprecated constructors, IndexWriter's autoCommit is always true
> -
>
> Key: LUCENE-1338
> URL: https://issues.apache.org/jira/browse/LUCENE-1338
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
>Priority: Minor
>
> With non-deprecated constructors, IndexWriter's autoCommit is always true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1338) With non-deprecated constructors, IndexWriter's autoCommit is always true

2008-07-17 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614404#action_12614404
 ] 

Ning Li commented on LUCENE-1338:
-

Or is the intention to make autoCommit always false after deprecated 
constructors are removed?

> With non-deprecated constructors, IndexWriter's autoCommit is always true
> -
>
> Key: LUCENE-1338
> URL: https://issues.apache.org/jira/browse/LUCENE-1338
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
>Priority: Minor
>
> With non-deprecated constructors, IndexWriter's autoCommit is always true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1338) With non-deprecated constructors, IndexWriter's autoCommit is always true

2008-07-17 Thread Ning Li (JIRA)
With non-deprecated constructors, IndexWriter's autoCommit is always true
-

 Key: LUCENE-1338
 URL: https://issues.apache.org/jira/browse/LUCENE-1338
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
Priority: Minor


With non-deprecated constructors, IndexWriter's autoCommit is always true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1228) IndexWriter.commit() does not update the index version

2008-03-13 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578518#action_12578518
 ] 

Ning Li commented on LUCENE-1228:
-

Does SegmentInfos really need both "version" and "generation"? Is "generation" 
sufficient?

> IndexWriter.commit()  does not update the index version
> ---
>
> Key: LUCENE-1228
> URL: https://issues.apache.org/jira/browse/LUCENE-1228
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Attachments: lucene-1228-commit-reopen.patch
>
>
> IndexWriter.commit() can update the index *version* and *generation* but the 
> update of *version* is lost.
> As result added documents are not seen by IndexReader.reopen().
> (There might be other side effects that I am not aware of).
> The fix is 1 line - update also the version in 
> SegmentsInfo.updateGeneration().
> (Finding this line involved more lines though... :-) )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

2008-03-11 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-1035:


Attachment: LUCENE-1035.contrib.patch

Re-do as a contrib package. Creating BufferPooledDirectory with your customized 
file name filter for readers allows you to decide which files you want to use 
the caching layer for.

The package includes some tests. I also modified and tested the core tests with 
the caching layer in a private setting and all tests passed.

> Optional Buffer Pool to Improve Search Performance
> --
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.contrib.patch, LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1204) IndexWriter.deleteDocuments bug

2008-03-06 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575782#action_12575782
 ] 

Ning Li commented on LUCENE-1204:
-

> I think this is a false alarm.

I just found out the same thing. It's a good test though.

> IndexWriter.deleteDocuments bug
> ---
>
> Key: LUCENE-1204
> URL: https://issues.apache.org/jira/browse/LUCENE-1204
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Yonik Seeley
>Assignee: Michael McCandless
> Attachments: LUCENE-1204.patch, LUCENE-1204.take2.patch
>
>
> IndexWriter.deleteDocuments() fails random testing

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

2008-03-03 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574782#action_12574782
 ] 

Ning Li commented on LUCENE-1035:
-

> It looks like this was never fully done. I wonder if this should be closed, 
> esp. since Ning might be working on slightly different problems now.

Sorry for the delay. I'll spend some time later this week or early next week to 
update and make it a contrib patch.

> Optional Buffer Pool to Improve Search Performance
> --
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1194) Add deleteByQuery to IndexWriter

2008-02-27 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572957#action_12572957
 ] 

Ning Li commented on LUCENE-1194:
-

> As of LUCENE-1044, when autoCommit=true, IndexWriter only commits on
> committing a merge, not with every flush.

I see. Interesting.

> Hmmm ... but, there is actually the reverse problem now with my patch:
> an auto commit can actually commit deletes before the corresponding
> added docs are committed (from updateDocument calls). This is
> because, when we commit we only sync & commit the merged segments (not
> the flushed segments).

Yep.

> Though, autoCommit=true is deprecated; once we
> remove that (in 3.0) this problem goes away. I'll have to ponder how
> to fix that for now up until 3.0...it's tricky. Maybe before 3.0
> we'll just have to flush all deletes whenever we flush a new
> segment

I think flushing deletes when we flush a new segment is fine before 3.0.
In 3.0, is the plan to default autoCommit to false or to disable autoCommit
entirely? The latter, right?

> Also, I don't think we need updateByQuery? Eg in 3.0 when autoCommit
> is hardwired to false then you can deleteDocuments(Query) and then
> addDocument(...) and it will be atomic.

Agree. When autoCommit is disabled, we don't need any update method.

> Add deleteByQuery to IndexWriter
> 
>
> Key: LUCENE-1194
> URL: https://issues.apache.org/jira/browse/LUCENE-1194
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1194.patch
>
>
> This has been discussed several times recently:
>   http://markmail.org/message/awlt4lmk3533epbe
>   http://www.gossamer-threads.com/lists/lucene/java-user/57384#57384
> If we add deleteByQuery to IndexWriter then this is a big step towards
> allowing IndexReader to be readonly.
> I took the approach suggested in that first thread: I buffer delete
> queries just like we now buffer delete terms, holding the max docID
> that the delete should apply to.
> Then, I also decoupled flushing deletes (mapping term or query -->
> actual docIDs that need deleting) from flushing added documents, and
> now I flush deletes only when a merge is started, or on commit() or
> close().  SegmentMerger now exports the docID map it used when
> merging, and I use that to renumber the max docIDs of all pending
> deletes.
> Finally, I turned off tracking of memory usage of pending deletes
> since they now live beyond each flush.  Deletes are now only
> explicitly flushed if you set maxBufferedDeleteTerms to something
> other than DISABLE_AUTO_FLUSH.  Otherwise they are flushed at the
> start of every merge.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1194) Add deleteByQuery to IndexWriter

2008-02-26 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572576#action_12572576
 ] 

Ning Li commented on LUCENE-1194:
-

Great to see deleteByQuery being added to IndexWriter!

> Then, I also decoupled flushing deletes (mapping term or query -->
> actual docIDs that need deleting) from flushing added documents, and
> now I flush deletes only when a merge is started, or on commit() or
> close().

When autoCommit is true, we have to flush deletes with added documents
for update atomicity, don't we? UpdateByQuery can be added, if there is a
need.

> SegmentMerger now exports the docID map it used when merging,
> and I use that to renumber the max docIDs of all pending deletes.

Because of renumbering, we don't have to flush deletes at the start of
every merge, right? But it is a good time to flush deletes.

> Add deleteByQuery to IndexWriter
> 
>
> Key: LUCENE-1194
> URL: https://issues.apache.org/jira/browse/LUCENE-1194
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1194.patch
>
>
> This has been discussed several times recently:
>   http://markmail.org/message/awlt4lmk3533epbe
>   http://www.gossamer-threads.com/lists/lucene/java-user/57384#57384
> If we add deleteByQuery to IndexWriter then this is a big step towards
> allowing IndexReader to be readonly.
> I took the approach suggested in that first thread: I buffer delete
> queries just like we now buffer delete terms, holding the max docID
> that the delete should apply to.
> Then, I also decoupled flushing deletes (mapping term or query -->
> actual docIDs that need deleting) from flushing added documents, and
> now I flush deletes only when a merge is started, or on commit() or
> close().  SegmentMerger now exports the docID map it used when
> merging, and I use that to renumber the max docIDs of all pending
> deletes.
> Finally, I turned off tracking of memory usage of pending deletes
> since they now live beyond each flush.  Deletes are now only
> explicitly flushed if you set maxBufferedDeleteTerms to something
> other than DISABLE_AUTO_FLUSH.  Otherwise they are flushed at the
> start of every merge.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

2007-10-29 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538638
 ] 

Ning Li commented on LUCENE-1035:
-

> The question is whether such situations are common enough to warrant adding 
> this to the core.

Agree.

> A way around that might be to layer it on top of FSDirectory and add it to 
> contrib.

I'd be happy to do that. I'll also include the following in the javadoc which 
hopefully is a fair assessment:

"When will a buffer pool help:
  - When an index is significantly larger than the file system cache, the hit 
ratio of a buffer pool is probably low which means insignificant performance 
improvement.
  - When an index is about the size of the file system cache or smaller, a 
buffer pool with good enough hit ratio will help if the IO system calls are the 
bottleneck. An example is if you have many "AND" queries which causes a lot 
large skips."

> Optional Buffer Pool to Improve Search Performance
> --
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

2007-10-26 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538129
 ] 

Ning Li commented on LUCENE-1035:
-

> That seems like quite a few docs to retrieve--any particular reason why?

I was guessing most applications won't want all 590K results, no? Lucene is 
used in so many different ways. No represent-all use case.

> I echo Hoss' comment--proximity searching is important even if it isn't used 
> much directly by users.

Hmm, I agree with you and Hoss, especially in applications where proximity is 
used to rank results of OR queries.

> Optional Buffer Pool to Improve Search Performance
> --
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

2007-10-26 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538112
 ] 

Ning Li commented on LUCENE-1035:
-

> I'll change to "OR" queries and see what happens.

  Query set with average 590K results, retrieving docids for the first 5K
  Buffer Pool SizeHit RatioQueries per second
 0 N/A 1.9
 16M   53% 1.9
 32M   68% 2.0
 64M   90% 2.3
 128M/256M/512M  99% 2.3

As Yonik pointed out, in the previous "AND" tests, the bottleneck is the system 
call to move data from file system cache to userspace. Here in the "OR" tests, 
much fewer such calls are made therefore the speedup is less significant. Wish 
I could get a real query workload for this dataset.

> Actually, phrase queries would be really interesting too since they hit the 
> term positions.

Phrase queries are rare and term distribution is highly skewed according to the 
following study on the Excite query log:
Spink, Amanda and Xu, Jack L. (2000)   "Selected results from a large study of 
Web searching: the Excite study".  Information Research, 6(1) Available at: 
http://InformationR.net/ir/6-1/paper90.html

"4. Phase Searching: Phrases (terms enclosed by quotation marks) were seldom, 
while only 1 in 16 queries contained a phrase - but correctly used.
5. Search Terms: Distribution: Jansen, et al., (2000) report the distribution 
of the frequency of use of terms in queries as highly skewed."

I didn't find a good on on the AOL query log. In any case, this buffer pool is 
not intended for general purpose. I mentioned RAMDirectory earlier. This is 
more like an alternative to RAMDirectory (that's why it's per directory): you 
want persistent storage for the index, yet it's not too big that you want 
RAMDirectory search performance. In addition, the entire index doesn't have to 
fit into memory, as long as the most queried part does. Hopefully, this 
benefits a subset of Lucene use cases.

> did you compare it against MMAP? I

The index I experimented on didn't fit in memory...


> Optional Buffer Pool to Improve Search Performance
> --
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

2007-10-26 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537995
 ] 

Ning Li commented on LUCENE-1035:
-

> most lucene usecases store much more than just the document id... that would 
> really affect locality.

In the experiments, I was simulating the (Google) paradigm where you retrieve 
just the docids and go to document servers for other things. If store almost 
always negatively affects locality, we can make the buffer pool sit only on 
data/files which we expect good locality (say posting lists), but not others.

> It seems like a simple LRU cache could really be blown out of the water by 
> certain types of queries (retrieve a lot of stored fields, or do an expanding 
> term query) that would force out all previously cached hotspots. Most OS 
> level caching has protection against this (multi-level LRU or whatever). But 
> of our user-level LRU cache fails, we've also messed up the OS level cache 
> since we've been hiding page hits from it.

That's a good point. We can improve the algorithm but hopefully still keep it 
simple and general. This buffer pool is not a fit-all solution. But hopefully 
it will benefit a number of use cases. That's why I say "optional". :)

> I'd like to see single term queries, "OR" queries, and queries across 
> multiple fields (also a common usecase) that match more documents tested also.

I'll change to "OR" queries and see what happens. The dataset is enwiki with 
four fields: docid, date (optional), title and body. Most terms are from title 
and body.


> Optional Buffer Pool to Improve Search Performance
> --
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

2007-10-26 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537978
 ] 

Ning Li commented on LUCENE-1035:
-

> Were the tests run using the same set of queries they were warmed for?

Yes, the same set of queries were used. The warm-up and the real run are two 
separate runs, which means the file system cache is warmed, but not the buffer 
pool.

Yes, it'd much better if a real query log could be obtained. I'll take a look 
at the AOL query log. I used to have an intranet query log which has a lot of 
term locality. That's why I think this could provide a good improvement.

> There are better ways to optimize for that, e.g., by caching hit lists, no?

That's useful and that's for exact query match. If there are a lot of shared 
query term but not exact query match, caching hit list won't help. This is sort 
of caching posting list but simpler.

> Optional Buffer Pool to Improve Search Performance
> --
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

2007-10-26 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537972
 ] 

Ning Li commented on LUCENE-1035:
-

> I don't think this is any better than the NIOFileCache directory I had 
> already submitted.

Are you referring to LUCENE-414? I just read it and yes, it's similar to the 
MemoryLRUCache part. Hopefully this is more general, not just for NioFile.

> It not really approved because the community felt that it did not offer much 
> over the standard OS file system cache.

Well, it shows it has its value in cases where you can achieve a reasonable hit 
ratio, right? This is optional. An application can test with its query log 
first to see the hit ratio and then decide whether to use a buffer pool and if 
so, how large.

> Optional Buffer Pool to Improve Search Performance
> --
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

2007-10-25 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-1035:


Summary: Optional Buffer Pool to Improve Search Performance  (was: ptional 
Buffer Pool to Improve Search Performance)

> Optional Buffer Pool to Improve Search Performance
> --
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1035) ptional Buffer Pool to Improve Search Performance

2007-10-25 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-1035:


Lucene Fields: [Patch Available]  (was: [New])

> ptional Buffer Pool to Improve Search Performance
> -
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1035) ptional Buffer Pool to Improve Search Performance

2007-10-25 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-1035:


Attachment: LUCENE-1035.patch

Coding Changes
--
New classes are localized to the store package and so as most of the changes.
  - Two new interfaces: BareInput and BufferPool.
  - BareInput takes a subset of IndexInput's methods such as readBytes
(IndexInput now implements BareInput).
  - BufferPoolLRU is a simple implementation of BufferPool interface.
It uses a doubly linked list for the LRU algorithm.
  - BufferPooledIndexInput is a subclass of BufferedIndexInput. It takes
a BareInput and a BufferPool. For BufferedIndexInput's readInternal,
it will read from the BufferPool, and BufferPool will read from its
cache if it's a hit and read from BareInput if it's a miss.
  - A FSDirectory object can optionally be created with a BufferPool with
its size specified by a buffer size and number of buffers. BufferPool
is shared among IndexInput of read-only files in the directory.

Unit tests
  - TestBufferPoolLRU.java is added.
  - Minor changes were made to _TestHelper.java and TestCompoundFile.java
because they made specific assumptions of the type of IndexInput returns
by FSDirectory.openInput.
  - All unit tests pass when I switch to always use a BufferPool.


Performance Results
---
I ran some experiments using the enwiki dataset. The experiments were run on
a dual 2.0Ghz Intel Xeon server running Linux. The dataset has about 3.5M
documents and the index built from it is more than 3G. The only store field
is a unique docid which is retrieved for each query result. All queries are
two-term AND queries generated from the dictionary. The first set of queries
returns between 1 to 1000 results with an average of 40. The second set
returns between 1 to 3000 with an average of 560. All tests were run warm.

1 Query set with average 40 results
  Buffer Pool SizeHit RatioQueries per second
  0 N/A230
  16M   55%250
  32M   63%282
  64M   73%345
  128M  85%476
  256M  95%672
  512M  98%685

2 Query set with average 560 results
  Buffer Pool SizeHit RatioQueries per second
  0 N/A 27
  16M   56% 29
  32M   70% 37
  64M   89% 55
  128M  97% 67
  256M  98% 71
  512M  99% 72

Of course if the tests are run cold, or if the queried portion of the index
is significantly larger than the file system cache, or there are a lot of
pre-processing of the queries and/or post-processing of the results, the
speedup will be less. But where it applies, i.e. a reasonable hit ratio can
be achieved, it should provide a good improvement.


> ptional Buffer Pool to Improve Search Performance
> -
>
> Key: LUCENE-1035
> URL: https://issues.apache.org/jira/browse/LUCENE-1035
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Ning Li
> Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1035) ptional Buffer Pool to Improve Search Performance

2007-10-25 Thread Ning Li (JIRA)
ptional Buffer Pool to Improve Search Performance
-

 Key: LUCENE-1035
 URL: https://issues.apache.org/jira/browse/LUCENE-1035
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Ning Li


Index in RAMDirectory provides better performance over that in FSDirectory.
But many indexes cannot fit in memory or applications cannot afford to
spend that much memory on index. On the other hand, because of locality,
a reasonably sized buffer pool may provide good improvement over FSDirectory.

This issue aims at providing such an optional buffer pool layer. In cases
where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
a good improvement over FSDirectory.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1007) Flexibility to turn on/off any flush triggers

2007-10-01 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531513
 ] 

Ning Li commented on LUCENE-1007:
-

One more thing about the approximation of actual bytes used for buffered delete 
term: just remember Integer.SIZE returns the number of bits used, should 
convert it to number of bytes.

> Flexibility to turn on/off any flush triggers
> -
>
> Key: LUCENE-1007
> URL: https://issues.apache.org/jira/browse/LUCENE-1007
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Ning Li
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1007.patch, LUCENE-1007.take2.patch, 
> LUCENE-1007.take3.patch
>
>
> See discussion at http://www.gossamer-threads.com/lists/lucene/java-dev/53186
> Provide the flexibility to turn on/off any flush triggers - ramBufferSize, 
> maxBufferedDocs and maxBufferedDeleteTerms. One of ramBufferSize and 
> maxBufferedDocs must be enabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1007) Flexibility to turn on/off any flush triggers

2007-09-27 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-1007:


Attachment: LUCENE-1007.take2.patch

Take2 counts buffered delete terms towards ram buffer used. A test case for it 
is added.

> Flexibility to turn on/off any flush triggers
> -
>
> Key: LUCENE-1007
> URL: https://issues.apache.org/jira/browse/LUCENE-1007
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Ning Li
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1007.patch, LUCENE-1007.take2.patch
>
>
> See discussion at http://www.gossamer-threads.com/lists/lucene/java-dev/53186
> Provide the flexibility to turn on/off any flush triggers - ramBufferSize, 
> maxBufferedDocs and maxBufferedDeleteTerms. One of ramBufferSize and 
> maxBufferedDocs must be enabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1007) Flexibility to turn on/off any flush triggers

2007-09-27 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-1007:


Attachment: LUCENE-1007.patch

Just got around to do the patch:
  - The patch includes changes to IndexWriter and DocumentsWriter to provide 
the flexibility to turn on/off any flush triggers.
  - Necessary changes to a couple of unit tests.
  - Also remove some unused imports.
  - All unit tests pass.

One question: Should we count buffered delete terms towards ram buffer used? 
Feel like we should. On the other hand, numBytesUsed only counts ram space 
which can be recycled.

> Flexibility to turn on/off any flush triggers
> -
>
> Key: LUCENE-1007
> URL: https://issues.apache.org/jira/browse/LUCENE-1007
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Ning Li
>Priority: Minor
> Attachments: LUCENE-1007.patch
>
>
> See discussion at http://www.gossamer-threads.com/lists/lucene/java-dev/53186
> Provide the flexibility to turn on/off any flush triggers - ramBufferSize, 
> maxBufferedDocs and maxBufferedDeleteTerms. One of ramBufferSize and 
> maxBufferedDocs must be enabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1007) Flexibility to turn on/off any flush triggers

2007-09-27 Thread Ning Li (JIRA)
Flexibility to turn on/off any flush triggers
-

 Key: LUCENE-1007
 URL: https://issues.apache.org/jira/browse/LUCENE-1007
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Ning Li
Priority: Minor


See discussion at http://www.gossamer-threads.com/lists/lucene/java-dev/53186

Provide the flexibility to turn on/off any flush triggers - ramBufferSize, 
maxBufferedDocs and maxBufferedDeleteTerms. One of ramBufferSize and 
maxBufferedDocs must be enabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527286
 ] 

Ning Li commented on LUCENE-847:


> This was actually intentional: I thought it fine if the application is
> sending multiple threads into IndexWriter to allow merges to run
> concurrently.  Because, the application can always back down to a
> single thread to get everything serialized if that's really required?

Today, applications use multiple threads on IndexWriter to get some concurrency 
on document parsing. With this patch, applications that want concurrent merges 
would simply use ConcurrentMergeScheduler, no?

Or a rename since it doesn't really serialize merges?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527239
 ] 

Ning Li commented on LUCENE-847:


Hmm, it's actually possible to have concurrent merges with SerialMergeScheduler.

Making SerialMergeScheduler.merge synchronize on SerialMergeScheduler will 
serialize all merges. A merge can still be concurrent with a ram flush.

Making SerialMergeScheduler.merge synchronize on IndexWriter will serialize all 
merges and ram flushes.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527224
 ] 

Ning Li commented on LUCENE-847:


Access of mergeThreads in ConcurrentMergeScheduler.merge() should be 
synchronized.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-11 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526628
 ] 

Ning Li commented on LUCENE-847:


> OK, another rev of the patch (take6).  I think it's close!

Yes, it's close! :)

> I made one simplification to the approach: IndexWriter now keeps track
> of "pendingMerges" (merges that mergePolicy has declared are necessary
> but have not yet been started), and "runningMerges" (merges currently
> in flight).  Then MergeScheduler just asks IndexWriter for the next
> pending merge when it's ready to run it.  This also cleaned up how
> cascading works.

I like this simplification.

>   * Optimize: optimize is now fully concurrent (it can run multiple
> merges at once, new segments can be flushed during an optimize,
> etc).  Optimize will optimize only those segments present when it
> started (newly flushed segments may remain separate).

This semantics does add a bit complexity - segmentsToOptimize, 
OneMerge.optimize.

> Good idea!  I took exactly this approach in patch I just attached.  I
> made a simple change: LogMergePolicy.findMergesForOptimize first
> checks if "normal merging" would want to do merges and returns them if
> so.  Since "normal merging" exposes concurrent merges, this gains us
> concurrency for optimize in cases where the index has too many
> segments.  I wasn't sure how otherwise to expose concurrency...

Another option is to schedule merges for the newest N segments and the next 
newest N segments and the next next... N is the merge factor.


A couple of other things:

  - It seems you intended sync() to be part of the MergeScheduler interface?

  - IndexWriter.close([true]), abort(): The behaviour should be the same 
whether the calling thread is the one that actually gets to do the closing. 
Right now, only the thread that actually does the closing waits for the 
closing. The others do not wait for the closing.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-09 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526029
 ] 

Ning Li commented on LUCENE-847:


Comments on optimize():

  - In the while loop of optimize(), LogMergePolicy.findMergesForOptimize 
returns a merge spec with one merge. If ConcurrentMergeScheduler is used, the 
one merge will be started in MergeScheduler.merge() and findMergesForOptimize 
will be called again. Before the one merge finishes, findMergesForOptimize will 
return the same spec but the one merge is already started. So only one 
concurrent merge is possible and the main thread will spin on calling 
findMergesForOptimize and attempting to merge.

  - One possible solution is to make LogMergePolicy.findMergesForOptimize 
return multiple merge candidates. It allows higher level of concurrency. It 
also alleviates a bit the problem of main thread spinning. To solve this 
problem, maybe we can check if a merge is actually started, then sleep briefly 
if not (which means all merges candidates are in conflict)?


A comment on concurrent merge threads:

  - One difference between the current approach on concurrent merge and the 
patch I posted a while back is that, in the current approach, a MergeThread 
object is created and started for every concurrent merge. In my old patch, 
maxThreadCount of threads are created and started at the beginning and are used 
throughout. Both have pros and cons.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-992) IndexWriter.updateDocument is no longer atomic

2007-09-05 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525271
 ] 

Ning Li commented on LUCENE-992:


The patch looks good! A few comments and/or observations:

  - addDocument(Document doc, Analyzer analyzer, Term delTerm): is it better to 
name it updateDocument?

  - I didn't check all the variable accesses in DocumentsWriter, but it seems 
abort() should lock for some of the variables it accesses. Or make abort() a 
synchronized method.

  - Observation: Large documents will block small documents from being flushed 
if addDocument of large documents is called before that of small ones. This is 
not the case before LUCENE-843.

> I also slightly changed the exception semantics in IndexWriter:
> previously if a disk full (or other) exception was hit when flushing
> the buffered docs, the buffered deletes were retained but the
> partially flushed buffered docs (if any) were discarded.

  - Observation: Before LUCENE-843, both buffered docs and buffered deletes 
were retained when such an exception occurs. Now both buffered docs and 
buffered deletes would be discared if an exception is hit.


> IndexWriter.updateDocument is no longer atomic
> --
>
> Key: LUCENE-992
> URL: https://issues.apache.org/jira/browse/LUCENE-992
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-992.patch
>
>
> Spinoff from LUCENE-847.
> Ning caught that as of LUCENE-843, we lost the atomicity of the delete
> + add in IndexWriter.updateDocument.
> Ning suggested a simple fix: move the buffered deletes into
> DocumentsWriter and let it do the delete + add atomically.  This has a
> nice side effect of also consolidating the "time to flush" logic in
> DocumentsWriter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-31 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524084
 ] 

Ning Li commented on LUCENE-847:


> Not quite following you here... not being eligible because the merge
> is in-progress in a thread is something I think any given MergePolicy
> should not have to track?  Once I factor out CMPW as its own Merger
> subclass I think the eligibility check happens only in IndexWriter?

I was referring to the current patch: LogMergePolicy does not check for 
eligibility, but CMPW, a subclass of MergePolicy, checks for eligibility. Yes, 
the eligibility check only happens in IndexWriter after we do Merger class.

> Rename to/from what?  (It is currently called MergePolicy.optimize).
> IndexWriter steps through the merges and only runs the ones that do
> not conflict (are eligible)?

Maybe rename to MergePolicy.findMergesToOptimize?

> > The reason I asked is because none of them are used right now. So
> > they might be used in the future?
> 
> Both of these methods are now called by IndexWriter (in the patch),
> upon flushing a new segment.

I was referring to the parameters. The parameters are not used.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-30 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523957
 ] 

Ning Li commented on LUCENE-847:


> True, but I was thinking CMPW could be an exception to this rule.  I
> guess I would change the rule to "simple merge policies don't have to
> run their own merges"?

:) Let's see if we have to make that exception.

> Good point...  I think I could refactor this so that cascading logic
> lives entirely in one place IndexWriter.

Another problem of the current cascading in CMPW.MergeThread is, if multiple 
candidate merges are found, all of them are added to 
IndexWriter.mergingSegments. But all but the first should be removed because 
only the first merge is carried out (thus removed from mergeSegments after the 
merge is done).

How do you make cascading live entirely in IndexWriter? Just removing cascading 
from CMPW.MergeThread has one drawback. For example, segment sizes of an index 
are: 40, 20, 10, buffer size is 10 and merge factor is 2. A buffer full flush 
of 10 will trigger merge of 10 & 10, then cascade to 20 & 20, then cascade to 
40 & 40. CMPW without cascading will stop after 10 & 10 since 
IndexWriter.maybeMerge has already returned. Then we have to wait for the next 
flush to merge 20 & 20.

> How would this be used?  Ie, how would one make an IndexWriter that
> uses the ConcurrentMerger?  Would we add expert methods
> IndexWriter.set/getIndexMerger(...)?  (And presumably the mergePolicy
> is now owned by IndexMerger so it would have the
> set/getMergePolicy(...)?)
> 
> Also, how would you separate what remains in IW vs what would be in
> IndexMerger?

Maybe Merger does and only does merge (so IndexWriter still owns MergePolicy)? 
Say, base class Merger.merge simply calls IndexWriter.merge. 
ConcurrentMerger.merge creates a merge thread if possible. Otherwise it calls 
super.merge, which does non-concurrent merge. IndexWriter simply calls its 
merger's merge instead of its own merge. Everything else remains in IndexWriter.


1
> Hmm ... you're right.  This is a separate issue from merge policy,
> right?  Are you proposing buffering deletes in DocumentsWriter
> instead?

Yes, this is a separate issue. And yes if we consider DocumentsWriter as 
staging area.

2
> Good catch!  How to fix?  One thing we could do is always use
> SegmentInfo.reset(...) and never swap in clones at the SegmentInfo
> level.  This way using the default 'equals' (same instance) would
> work.  Or we could establish identity (equals) of a SegmentInfo as
> checking if the directory plus segment name are equal?  I think I'd
> lean to the 2nd option

I think the 2nd option is better.

3
> Hmmm yes.  In fact I think we can remove synchronized from optimize
> altogether since within it we are synchronizing(this) at the right
> places?  If more than one thread calls optimize at once, externally,
> it is actually OK: they will each pick a merge that's viable
> (concurrently) and will run the merge, else return once there is no
> more concurrency left.  I'll add a unit test that confirms this.

That seems to be the case. The fact that "the same merge spec will be returned 
without changes to segmentInfos" reminds me: MergePolicy.findCandidateMerges 
finds merges which may not be eligible; but CMPW checks for eligibility when 
looking for candidate merges. Maybe we should unify the behaviour? BTW, 
MergePolicy.optimize (a rename?) doesn't check for eligibility either.

4
> Well, useCompoundFile(...) is given a single newly flushed segment and
> should decide whether it should be CFS.  Whereas
> useCompoundDocStore(...) is called when doc stores are flushed.  When
> autoCommit=false, segments can share a single set of doc stores, so
> there's no single SegmentInfo to pass down.

The reason I asked is because none of them are used right now. So they might be 
used in the future?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-29 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523621
 ] 

Ning Li commented on LUCENE-847:


I include comments for both LUCENE-847 and LUCENE-870 here since they are 
closely related.

I like the stateless approach used for refactoring merge policy. But modeling 
concurrent merge (ConcurrentMergePolicyWrapper) as a MergePolicy seems to be 
inconsistent with the MergePolicy interface:
  1 As you pointed out, "the merge policy is no longer responsible for running 
the merges itself". MergePolicy.maybeMerge simply returns a merge 
specification. But ConcurrentMergePolicyWrapper.maybeMerge actually starts 
concurrent merge threads thus doing the merges.
  2 Related to 1, cascading is done in IndexWriter in non-concurrent case. But 
in concurrent case, cascading is also done in merge threads which are started 
by ConcurrentMergePolicyWrapper.maybeMerge.

MergePolicy.maybeMerge should continue to simply return a merge specification. 
(BTW, should we rename this maybeMerge to, say, findCandidateMerges?) Can we 
carve the merge process out of IndexWriter into a Merger? IndexWriter still 
provides the building blocks - merge(OneMerge), mergeInit(OneMerge), etc. 
Merger uses these building blocks. A ConcurrentMerger extends Merger but starts 
concurrent merge threads as ConcurrentMergePolicyWrapper does.


Other comments:
1 UpdateDocument's and deleteDocument's bufferDeleteTerm are synchronized on 
different variables in this patch. However, the semantics of updateDocument 
changed since LUCENE-843. Before LUCENE-843, updateDocument, which is a delete 
and an insert, guaranteed the delete and the insert are committed together 
(thus an update). Now it's possible that they are committed in different 
transactions. If we consider DocumentsWriter as the RAM staging area for 
IndexWriter, then deletes are also buffered in RAM staging area and we can 
restore our previous semantics, right?

2 OneMerge.segments seems to rely on its segment infos' reference to segment 
infos of IndexWriter.segmentInfos. The use in commitMerge, which calls 
ensureContiguousMerge, is an example. However, segmentInfos can be a cloned 
copy because of exceptions, thus the reference broken.

3 Calling optimize of an IndexWriter with the current 
ConcurrentMergePolicyWrapper may cause deadlock: the one merge spec returned by 
MergePolicy.optimize may be in conflict with a concurrent merge (the same merge 
spec will be returned without changes to segmentInfos), but a concurrent merge 
cannot finish because optimize is holding the lock.

4 Finally, a couple of minor things:
  1 LogMergePolicy.useCompoundFile(SegmentInfos infos, SegmentInfo info) and 
useCompoundDocStore(SegmentInfos infos): why the parameters?
  2 Do we need doMergeClose in IndexWriter? Can we simply close a MergePolicy 
if not null?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-987) Deprecate IndexModifier

2007-08-22 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-987:
---

Attachment: deprecateIndexModifier.patch

> Deprecate IndexModifier
> ---
>
> Key: LUCENE-987
> URL: https://issues.apache.org/jira/browse/LUCENE-987
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Index
>Reporter: Ning Li
>Priority: Minor
> Attachments: deprecateIndexModifier.patch
>
>
> See discussion at 
> http://www.gossamer-threads.com/lists/lucene/java-dev/52017?search_string=deprecating%20indexmodifier;#52017
> This is to deprecate IndexModifier before 3.0 and remove it in 3.0.
> This patch includes:
>   1 IndexModifier and TestIndexModifier are deprecated.
>   2 TestIndexWriterModify is added. It is similar to TestIndexModifer but 
> uses IndexWriter and has a few other changes. The changes are because of the 
> difference between IndexModifier and IndexWriter.
>   3 TestIndexWriterLockRelease and TestStressIndexing are switched to use 
> IndexWriter instead of IndexModifier.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-987) Deprecate IndexModifier

2007-08-22 Thread Ning Li (JIRA)
Deprecate IndexModifier
---

 Key: LUCENE-987
 URL: https://issues.apache.org/jira/browse/LUCENE-987
 Project: Lucene - Java
  Issue Type: Test
  Components: Index
Reporter: Ning Li
Priority: Minor


See discussion at 
http://www.gossamer-threads.com/lists/lucene/java-dev/52017?search_string=deprecating%20indexmodifier;#52017

This is to deprecate IndexModifier before 3.0 and remove it in 3.0.

This patch includes:
  1 IndexModifier and TestIndexModifier are deprecated.
  2 TestIndexWriterModify is added. It is similar to TestIndexModifer but uses 
IndexWriter and has a few other changes. The changes are because of the 
difference between IndexModifier and IndexWriter.
  3 TestIndexWriterLockRelease and TestStressIndexing are switched to use 
IndexWriter instead of IndexModifier.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-978) GC resources in TermInfosReader when exception occurs in its constructor

2007-08-16 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-978:
---

Attachment: Readers.patch

Similar fixes are added for FieldsReader and TermVectorsReader as well.

> GC resources in TermInfosReader when exception occurs in its constructor
> 
>
> Key: LUCENE-978
> URL: https://issues.apache.org/jira/browse/LUCENE-978
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
>Priority: Minor
> Attachments: Readers.patch, TermInfosReader.patch
>
>
> I replaced IndexModifier with IndexWriter in test case TestStressIndexing and 
> noticed the test failed from time to time because some .tis file is still 
> open when MockRAMDirectory.close() is called. It turns out it is because .tis 
> file is not closed if an exception occurs in TermInfosReader's constructor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-978) GC resources in TermInfosReader when exception occurs in its constructor

2007-08-16 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520286
 ] 

Ning Li commented on LUCENE-978:


> Agreed. Actually, it also looks like we need to do something similar for 
> FieldsReader/TermVectorsReader too?

That's right. I'll submit a new patch.

> GC resources in TermInfosReader when exception occurs in its constructor
> 
>
> Key: LUCENE-978
> URL: https://issues.apache.org/jira/browse/LUCENE-978
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
>Priority: Minor
> Attachments: TermInfosReader.patch
>
>
> I replaced IndexModifier with IndexWriter in test case TestStressIndexing and 
> noticed the test failed from time to time because some .tis file is still 
> open when MockRAMDirectory.close() is called. It turns out it is because .tis 
> file is not closed if an exception occurs in TermInfosReader's constructor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-978) GC resources in TermInfosReader when exception occurs in its constructor

2007-08-16 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-978:
---

Lucene Fields: [Patch Available]  (was: [New])

> GC resources in TermInfosReader when exception occurs in its constructor
> 
>
> Key: LUCENE-978
> URL: https://issues.apache.org/jira/browse/LUCENE-978
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
>Priority: Minor
> Attachments: TermInfosReader.patch
>
>
> I replaced IndexModifier with IndexWriter in test case TestStressIndexing and 
> noticed the test failed from time to time because some .tis file is still 
> open when MockRAMDirectory.close() is called. It turns out it is because .tis 
> file is not closed if an exception occurs in TermInfosReader's constructor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-978) GC resources in TermInfosReader when exception occurs in its constructor

2007-08-15 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-978:
---

Attachment: TermInfosReader.patch

> GC resources in TermInfosReader when exception occurs in its constructor
> 
>
> Key: LUCENE-978
> URL: https://issues.apache.org/jira/browse/LUCENE-978
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
>Priority: Minor
> Attachments: TermInfosReader.patch
>
>
> I replaced IndexModifier with IndexWriter in test case TestStressIndexing and 
> noticed the test failed from time to time because some .tis file is still 
> open when MockRAMDirectory.close() is called. It turns out it is because .tis 
> file is not closed if an exception occurs in TermInfosReader's constructor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-978) GC resources in TermInfosReader when exception occurs in its constructor

2007-08-15 Thread Ning Li (JIRA)
GC resources in TermInfosReader when exception occurs in its constructor


 Key: LUCENE-978
 URL: https://issues.apache.org/jira/browse/LUCENE-978
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Ning Li
Priority: Minor
 Attachments: TermInfosReader.patch

I replaced IndexModifier with IndexWriter in test case TestStressIndexing and 
noticed the test failed from time to time because some .tis file is still open 
when MockRAMDirectory.close() is called. It turns out it is because .tis file 
is not closed if an exception occurs in TermInfosReader's constructor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-08 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518520
 ] 

Ning Li commented on LUCENE-847:


> Furthermore, I think this is all contained within IndexWriter, right?
> Ie when we go to "replace/checkin" the newly merged segment, this
> "merge newly flushed deletes" would execute at that time. And, I
> think, we would block flushes while this is happening, but
> addDocument/deleteDocument/updateDocument would still be allowed?

Yes and yes. :-)

> Couldn't we also just update the docIDs of pending deletes, and not
> flush? Ie we know the mapping of old -> new docID caused by the
> merge, so we can run through all deleted docIDs and remap? 

Hmm, I was worried quite a number of delete docIDs could be buffered,
but I guess it's still better than having to do a flush. So yes, this is better!

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-08 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518486
 ] 

Ning Li commented on LUCENE-847:


The following comments are about the impact on merge if we add
"deleteDocument(int doc)" (and deprecate IndexModifier). Since it
concerns the topic in this issue, I also post it here to get your opinions.

I'm thinking about the impact of adding "deleteDocument(int doc)" on
LUCENE-847, especially on concurrent merge. The semantics of
"deleteDocument(int doc)" is that the document to delete is specified
by the document id on the index at the time of the call. When a merge
is finished and the result is being checked into IndexWriter's
SegmentInfos, document ids may change. Therefore, it may be necessary
to flush buffered delete doc ids (thus buffered docs and delete terms
as well) before a merge result is checked in.

The flush is not necessary if there is no buffered delete doc ids. I
don't think it should be the reason not to support "deleteDocument(int
doc)" in IndexWriter. But its impact on concurrent merge is a concern.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-08 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518453
 ] 

Ning Li commented on LUCENE-847:


On 8/8/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
> Actually I was talking about my idea (to "simplify MergePolicy.merge
> API").  With the simplification (whereby MergePolicy.merge just
> returns the MergeSpecification instead of driving the merge itself) I
> believe it's simple to make a concurrency wrapper around any merge
> policy, and, have all necessary locking for SegmentInfos inside
> IndexWriter.

I agree with Mike. In fact, MergeSelector.select, which is the counterpart
of MergePolicy.merge in the patch I submitted for concurrent merge,
simply returns a MergeSpecification. It's simple and sufficient to have
all necessary lockings for SegmentInfos in one class, say IndexWriter.
For example, IndexWriter locks SegmentInfos when MergePolicy(MergeSelector)
picks a merge spec. Another example, when a merge is finished, say
IndexWriter.checkin is called which locks SegmentInfos and replaces
the source segment infos with the target segment info.


On 8/7/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote:
> The synchronization is still tricky, since parts of segmentInfos are
> getting changed at various times and there are references and/or
> copies of it other places. And as Ning pointed out to me, we also
> have to deal with buffered delete terms. I'd say I got about 80% of
>the way there on the last go around. I'm hoping to get all the way
> this time.

It just occurred to me that there is a neat way to handle deletes that
are flushed during a concurrent merge. For example, MergePolicy
decides to merge segments B and C, with B's delete file 0001 and
C's 100. When the concurrent merge finishes, B's delete file becomes
0011 and C's 110. We do a simple computation on the delete bit
vectors and check in the merged segment with delete file 00110.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-938) I/O exceptions can cause loss of buffered deletes

2007-07-12 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512271
 ] 

Ning Li commented on LUCENE-938:


I didn't make myself clear. Let me try again. The patch includes two parts of 
changes to IndexWriter: one adds localNumBufferedDeleteTerms and 
localBufferedDeleteTerms and uses them in startTransaction() and 
rollbackTransaction(); the other fixes loss of buffered deletes in flush() (and 
applyDeletes() which is used by flush()).

The second part is good and that's where you had the comment on cloning.

I was referring to the first part. In startTransaction(), 
"localBufferedDeleteTerms = bufferedDeleteTerms" reference-copies 
bufferedDeleteTerms. Then more delete terms are buffered into 
bufferedDeleteTerms... so localBufferedDeleteTerms would have the delete terms 
buffered between startTransaction() and the first flush()... 

> I/O exceptions can cause loss of buffered deletes
> -
>
> Key: LUCENE-938
> URL: https://issues.apache.org/jira/browse/LUCENE-938
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: LUCENE-938.take2.patch, LUCENE-938.txt, LUCENE-938.txt
>
>
> Some I/O exceptions that result in segmentInfos rollback operations can cause 
> buffered deletes that existed before the rollback creation point to be 
> incorrectly lost when the IOException triggers a rollback.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-938) I/O exceptions can cause loss of buffered deletes

2007-07-05 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510422
 ] 

Ning Li commented on LUCENE-938:


Good catch, Steven!

One thing though: I thought we had assumed that there wouldn't be any buffered 
docs or delete terms when startTransaction(), so no local copies are necessary. 
That means no change to startTransaction() and rollbackTransaction(). If there 
could be buffered docs and delete terms when startTransaction(), then local 
copies should be made for buffered docs and localNumBufferedDeleteTerms should 
clone numBufferedDeleteTerms instead of just copying the reference.

> I/O exceptions can cause loss of buffered deletes
> -
>
> Key: LUCENE-938
> URL: https://issues.apache.org/jira/browse/LUCENE-938
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: LUCENE-938.take2.patch, LUCENE-938.txt, LUCENE-938.txt
>
>
> Some I/O exceptions that result in segmentInfos rollback operations can cause 
> buffered deletes that existed before the rollback creation point to be 
> incorrectly lost when the IOException triggers a rollback.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-28 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-847:
---

Attachment: concurrentMerge.patch

Here is a patch for concurrent merge as discussed in:
http://www.gossamer-threads.com/lists/lucene/java-dev/45651?search_string=concurrent%20merge;#45651

I put it under this issue because it helps design and verify a factored merge 
policy which would provide good support for concurrent merge.

As described before, a merge thread is started when a writer is created and 
stopped when the writer is closed. The merge process consists of three steps: 
first, create a merge task/spec; then, carry out the actual merge; finally, 
"commit" the merged segment (replace segments it merged in segmentInfos), but 
only after appropriate deletes are applied. The first and last steps are fast 
and synchronous. The second step is where concurrency is achieved. Does it make 
sense to capture them as separate steps in the factored merge policy?

As discussed in 
http://www.gossamer-threads.com/lists/lucene/java-dev/45651?search_string=concurrent%20merge;#45651:
 documents can be buffered while segments are merged, but no more than 
maxBufferedDocs can be buffered at any time. So this version provides limited 
concurrency. The main goal is to achieve short ingestion hiccups, especially 
when the ingestion rate is low. After the factored merge policy, we could 
provide different versions of concurrent merge policies which provide different 
levels of concurrency. :-)

All unit tests pass. If IndexWriter is replaced with 
IndexWriterConcurrentMerge, all unit tests pass except the following:
  - TestAddIndexesNoOptimize and TestIndexWriter*
This is because they check segment sizes expecting all merges are done. 
These tests pass if these checks are performed after the concurrent merges 
finish. The modified tests (with waits for concurrent merges to finish) are in 
TestIndexWriterConcurrentMerge*.
  - testExactFieldNames in TestBackwardCompatibility and 
testDeleteLeftoverFiles in TestIndexFileDeleter
In both cases, file name segments_a is expected, but the actual is 
segments_7. This is because with concurrent merge, if compound file is used, 
only the compound version is "committed" (added to segmentInfos), not the 
non-compound version, thus the lower segments generation number.

Cheers,
Ning


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2007-01-25 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-565:
---

Lucene Fields: [Patch Available]

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: https://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: NewIndexModifier.Jan2007.patch, 
> NewIndexModifier.Sept21.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2007-01-25 Thread Ning Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Li updated LUCENE-565:
---

Attachment: NewIndexModifier.Jan2007.patch

The patch is updated because of the code committed to IndexWriter since the 
last patch. The high-level design is the same as before. See comments on 
18/Dec/06.

Care has been taken to make sure if writer/modifier tries to commit but hits 
disk full that writer/modifier remains consistent and usable. A test case is 
added to TestNewIndexModifierDelete to test this.

All tests pass.

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: https://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: NewIndexModifier.Jan2007.patch, 
> NewIndexModifier.Sept21.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> S

[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12459506 ] 

Ning Li commented on LUCENE-565:


Here is the design overview. Minor changes were made because of lock-less 
commits.

In the current IndexWriter, newly added documents are buffered in ram in the 
form of one-doc segments.
When a flush is triggered, all ram documents are merged into a single segment 
and written to disk.
Further merges of disk segments may be triggered.

NewIndexModifier extends IndexWriter and supports document deletion in addition 
to document addition.
NewIndexModifier not only buffers newly added documents in ram, but also 
buffers deletes in ram.
The following describes what happens when a flush is triggered:

  1 merge ram documents into one segment and written to disk
do not commit - segmentInfos is updated in memory, but not written to disk

  2 for each disk segment to which a delete may apply
  open reader
  delete docs*, write new .delN file (* Care is taken to ensure that an 
interleaved sequence of
inserts and deletes for the same document are properly serialized.)
  close reader, but do not commit - segmentInfos is updated in memory, but 
not written to disk

  3 commit - write new segments_N to disk

Further merges for disk segments work the same as before.


As an option, we can cache readers to minimize the number of reader 
opens/closes. In other words,
we can trade memory for better performance. The design would be modified as 
follows:

  1 same as above

  2 for each disk segment to which a delete may apply
  open reader and cache it if not already opened/cached
  delete docs*, write new .delN file

  3 commit - write new segments_N to disk

The logic for disk segment merge changes accordingly: open reader if not 
already opened/cached;
after a merge is complete, close readers for the segments that have been merged.


> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, 
> perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage 

[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12459490 ] 

Ning Li commented on LUCENE-565:


Many versions of the patch were submitted as new code was committed to 
IndexWriter.java. For each version, all changes made were included in a single 
patch file.

I removed all but the latest version of the patch. Even this one is outdated by 
the commit of LUCENE-701 (lock-less commits). I was waiting for the commit of 
LUCENE-702 before submitting another patch. LUCENE-702 was committed this 
morning. So I'll submit an up-to-date patch over the holidays.

On 12/18/06, Paul Elschot (JIRA) <[EMAIL PROTECTED]> wrote:
> I'd like to give this a try over the upcoming holidays. 

That's great! We can discuss/compare the designs then. Or, we can 
discuss/compare the designs before submitting new patches.

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, 
> perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batc

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: newMergePolicy.Sept08.patch)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, 
> perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For 

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: KeepDocCount0Segment.Sept15.patch)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, 
> perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: TestWriterDelete.java)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, 
> perf-test-res.JPG, perf-test-res2.JPG, perfres.log, 
> TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: NewIndexWriter.July18.patch)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, 
> perf-test-res.JPG, perf-test-res2.JPG, perfres.log, 
> TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



--

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: NewIndexWriter.Aug23.patch)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, 
> perf-test-res.JPG, perf-test-res2.JPG, perfres.log, 
> TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: NewIndexModifier.July09.patch)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, 
> perf-test-res.JPG, perf-test-res2.JPG, perfres.log, 
> TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: IndexWriter.patch)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIR

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: IndexWriter.July09.patch)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
F

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: (was: IndexWriter.java)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more 

[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-13 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12458205 ] 

Ning Li commented on LUCENE-565:


> Minor question... in the places that you use Vector, is there a reason you 
> aren't using ArrayList? 
> And in methods that pass a Vector, that could be changed to a List . 

ArrayList and List can be used, respectively.

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technol

[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-13 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12458158 ] 

Ning Li commented on LUCENE-565:


> Can the same thing happen with your patch (with a smaller window), or are 
> deletes applied between writing the new segment and writing the new segments 
> file that references it?  (hard to tell from current diff in isolation)

No, it does not happen with the patch, no matter what the window size is.
This is because results of flushing ram - both inserts and deletes - are 
committed in the same transaction.

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
>

[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-12 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12457865 ] 

Ning Li commented on LUCENE-565:


> *or* you could choose to do it before a merge of the lowest level on-disk
> segments.  If none of the lowest level segments have deletes, you could
> even defer the deletes until after all the lowest-level segments have been
> merged.  This makes the deletes more efficient since it goes from
> O(mergeFactor * log(maxBufferedDocs)) to O(log(mergeFactor*maxBufferedDocs))

I don't think I like this semantics, though. With the semantics in the patch,
an update can be easily supported. With this semantics, an insert is flushed
yet a delete before the insert may or may not have been flushed.

> You are right that other forms of reader caching could increase the footprint,
> but it's nice to have the option of trading some memory for performance.

Agree. It'd be nice to cache all readers... :-)

Thanks again for your comments. Enjoy your PTO!

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inser

[jira] Commented: (LUCENE-702) Disk full during addIndexes(Directory[]) can corrupt index

2006-12-12 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-702?page=comments#action_12457858 ] 

Ning Li commented on LUCENE-702:


> This is actually intentional: I don't want to write to the same
> segments_N filename, ever, on the possibility that a reader may be
> reading it.  Admittedly, this should be quite rare (filling up disk
> and then experiencing contention, only on Windows), but still I wanted
> to keep "write once" even in this case.

In IndexWriter, the rollbackTransaction call in commitTransaction could
cause write to the same segment_N filename, right?

The "write once" semantics is not kept for segment names or .delN. This
is ok because no reader will read the old versions.

> Disk full during addIndexes(Directory[]) can corrupt index
> --
>
> Key: LUCENE-702
> URL: http://issues.apache.org/jira/browse/LUCENE-702
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Attachments: LUCENE-702.patch, LUCENE-702.take2.patch
>
>
> This is a spinoff of LUCENE-555
> If the disk fills up during this call then the committed segments file can 
> reference segments that were not written.  Then the whole index becomes 
> unusable.
> Does anyone know of any other cases where disk full could corrupt the index?
> I think disk full should worse lose the documents that were "in flight" at 
> the time.  It shouldn't corrupt the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-702) Disk full during addIndexes(Directory[]) can corrupt index

2006-12-11 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-702?page=comments#action_12457520 ] 

Ning Li commented on LUCENE-702:


It looks good. My two cents:

1 In the two rollbacks in mergeSegments (where inTransaction is false), the 
segmentInfos' generation is not always rolled back. So something like this 
could happen: two consecutive successful commits write segments_3 and 
segments_5, respectively. Nothing is broken, but it'd be nice to roll back 
completely (even for the IndexWriter instance) when a commit fails.

2 Code serving two purposes are (and has been) mixed in mergeSegments: one to 
merge segments and create compound file if necessary, the other to commit or 
roll back when inTransaction is false. It'd be nice if the two could be 
separated: optimize and maybeMergeSegments call mergeSegmentsAndCommit, which 
creates a transaction, calls mergeSegments and commits or rolls back; 
mergeSegments doesn't deal with commit or rollback. However, currently the 
non-CFS version is committed first even if useCompoundFile is true. Until 
that's changed, mergeSegments probably has to continue serving both purposes.


> Disk full during addIndexes(Directory[]) can corrupt index
> --
>
> Key: LUCENE-702
> URL: http://issues.apache.org/jira/browse/LUCENE-702
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Attachments: LUCENE-702.patch
>
>
> This is a spinoff of LUCENE-555
> If the disk fills up during this call then the committed segments file can 
> reference segments that were not written.  Then the whole index becomes 
> unusable.
> Does anyone know of any other cases where disk full could corrupt the index?
> I think disk full should worse lose the documents that were "in flight" at 
> the time.  It shouldn't corrupt the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-11-22 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12452039 ] 

Ning Li commented on LUCENE-565:


With the recent commits to IndexWriter, this patch no longer applies cleanly. 
The 5 votes for this issue encourages
me to submit yet another patch. :-) But before I do that, I'd like to briefly 
describe the design again and welcome all
suggestions that help improve it and help get it committed. :-)

With the new merge policy committed, the change to IndexWriter is minimal: 
three zero-or-one-line functions are
added and used.
  1 timeToFlushRam(): return true if number of ram segments >= maxBufferedDocs 
and used in maybeFlushRamSegments()
  2 anythingToFlushRam(): return true if number of ram segments > 0 and used in 
flushRamSegments()
  3 doAfterFlushRamSegments(): do nothing and called in mergeSegments() if the 
merge is on ram segments

The new IndexModifier is a subclass of IndexWriter and only overwrites the 
three functions described above.
  1 timeToFlushRam(): return true if number of ram segments >= maxBufferedDocs 
OR if number of buffered
 deletes >= maxBufferedDeletes
  2 anythingToFlushRam(): return true if number of ram segments > 0 OR if 
number of buffered deletes > 0
  3 doAfterFlushRamSegments(): properly flush buffered deletes

The new IndexModifier supports all APIs from the current IndexModifier except 
one: deleteDocument(int doc).
I had commented on this before:  "I deliberately left that one out. This is 
because document ids are changing
as documents are deleted and segments are merged. Users don't know exactly when 
segments are merged
thus ids are changed when using IndexModifier."

This behaviour is true for both the new IndexModifier and the current 
IndexModifier. If this is preventing this
patch from getting accepted, I'm willing to add this, but I will detail this in 
the Java doc so users of this function
are aware of this behaviour.


> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chap

[jira] Commented: (LUCENE-702) Disk full during addIndexes(Directory[]) can corrupt index

2006-11-07 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-702?page=comments#action_12448006 ] 

Ning Li commented on LUCENE-702:


> I think we should try to make all of the addIndexes calls (and more
> generally any call to Lucene) "transactional".

Agree. A transactional semantics would be better.

The approach you described for three addIndexes looks good.

addIndexes(IndexReader[]) is transactional but has two commits: one
when existing segments are merged at the beginning, the other at the
end when all segment/readers are merged.

addIndexes(Directory[]) can be fixed to have a similar behaviour:
first commit when existing segments are merged at the beginning, then
at the end when all old/new segments are merged.

addIndexesNoOptimize(Directory[]), on the other hand, does not merge
existing segments at the beginning. So when fixed, it will only have
one commit at the end which captures all the changes.


> Disk full during addIndexes(Directory[]) can corrupt index
> --
>
> Key: LUCENE-702
> URL: http://issues.apache.org/jira/browse/LUCENE-702
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>
> This is a spinoff of LUCENE-555
> If the disk fills up during this call then the committed segments file can 
> reference segments that were not written.  Then the whole index becomes 
> unusable.
> Does anyone know of any other cases where disk full could corrupt the index?
> I think disk full should worse lose the documents that were "in flight" at 
> the time.  It shouldn't corrupt the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-11-06 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12447657 ] 

Ning Li commented on LUCENE-565:



   [[ Old comment, sent by email on Thu, 6 Jul 2006 07:53:35 -0700 ]]

Hi Otis,

I will regenerate the patch and add more comments. :-)

Regards,
Ning




   
 "Otis Gospodnetic 
 (JIRA)"   
 <[EMAIL PROTECTED]>  To 
   [EMAIL PROTECTED]  
 07/05/2006 11:25   cc 
 PM
   Subject 
   [jira] Commented: (LUCENE-565)  
   Supporting deleteDocuments in   
   IndexWriter (Code and Performance   
   Results Provided)   
   
   
   
   
   
   




[
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12419396
 ]

Otis Gospodnetic commented on LUCENE-565:
-

I took a look at the patch and it looks good to me (anyone else had a
look)?
Unfortunately, I couldn't get the patch to apply :(

$ patch -F3 < IndexWriter.patch
(Stripping trailing CRs from patch.)
patching file IndexWriter.java
Hunk #1 succeeded at 58 with fuzz 1.
Hunk #2 succeeded at 112 (offset 2 lines).
Hunk #4 succeeded at 504 (offset 33 lines).
Hunk #6 succeeded at 605 with fuzz 2 (offset 57 lines).
missing header for unified diff at line 259 of patch
(Stripping trailing CRs from patch.)
can't find file to patch at input line 259
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
...
...
...
File to patch: IndexWriter.java
patching file IndexWriter.java
Hunk #1 FAILED at 802.
Hunk #2 succeeded at 745 with fuzz 2 (offset -131 lines).
1 out of 2 hunks FAILED -- saving rejects to file IndexWriter.java.rej


Would it be possible for you to regenerate the patch against IndexWriter in
HEAD?

Also, I noticed ^Ms in the patch, but I can take care of those easily
(dos2unix).

Finally, I noticed in 2-3 places that the simple logging via "infoStream"
variable was removed, for example:
-if (infoStream != null) infoStream.print("merging segments");

Perhaps this was just an oversight?

Looking forward to the new patch. Thanks!

Provided)
-


a
IndexWriter
http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049

to
deleting
version
using
batches.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to 

[jira] Commented: (LUCENE-701) Lock-less commits

2006-11-02 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-701?page=comments#action_12446656 ] 

Ning Li commented on LUCENE-701:


> That wouldn't be considered a failure because it's part of the retry logic. 
> At that point, an attempt would be made to open seg_2. 

>From the description of the retry logic, I thought the retry logic only 
>applies to the loading of the "segments_N" file, but not to the entire process 
>of loading all the files of an index.

You are right, it wouldn't be a failure if the retry logic is applied to the 
loading of all the files of an index.

> Lock-less commits
> -
>
> Key: LUCENE-701
> URL: http://issues.apache.org/jira/browse/LUCENE-701
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: index.prelockless.cfs.zip, index.prelockless.nocfs.zip, 
> lockless-commits-patch.txt
>
>
> This is a patch based on discussion a while back on lucene-dev:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200608.mbox/[EMAIL 
> PROTECTED]
> The approach is a small modification over the original discussion (see
> Retry Logic below).  It works correctly in all my cross-machine test
> case, but I want to open it up for feedback, testing by
> users/developers in more diverse environments, etc.
> This is a small change to how lucene stores its index that enables
> elimination of the commit lock entirely.  The write lock still
> remains.
> Of the two, the commit lock has been more troublesome for users since
> it typically serves an active role in production.  Whereas the write
> lock is usually more of a design check to make sure you only have one
> writer against the index at a time.
> The basic idea is that filenames are never reused ("write once"),
> meaning, a writer never writes to a file that a reader may be reading
> (there is one exception: the segments.gen file; see "RETRY LOGIC"
> below).  Instead it writes to generational files, ie, segments_1, then
> segments_2, etc.  Besides the segments file, the .del files and norm
> files (.sX suffix) are also now generational.  A generation is stored
> as an "_N" suffix before the file extension (eg, _p_4.s0 is the
> separate norms file for segment "p", generation 4).
> One important benefit of this is it avoids files contents caching
> entirely (the likely cause of errors when readers open an index
> mounted on NFS) since the file is always a new file.
> With this patch I can reliably instantiate readers over NFS when a
> writer is writing to the index.  However, with NFS, you are still forced to
> refresh your reader once a writer has committed because "point in
> time" searching doesn't work over NFS (see LUCENE-673 ).
> The changes are fully backwards compatible: you can open an old index
> for searching, or to add/delete docs, etc.  I've added a new unit test
> to test these cases.
> All units test pass, and I've added a number of additional unit tests,
> some of which fail on WIN32 in the current lucene but pass with this
> patch.  The "fileformats.xml" has been updated to describe the changes
> to the files (but XXX references need to be fixed before committing).
> There are some other important benefits:
>   * Readers are now entirely read-only.
>   * Readers no longer block one another (false contention) on
> initialization.
>   * On hitting contention, we immediately retry instead of a fixed
> (default 1.0 second now) pause.
>   * No file renaming is ever done.  File renaming has caused sneaky
> access denied errors on WIN32 (see LUCENE-665 ).  (Yonik, I used
> your approach here to not rename the segments_N file(try
> segments_(N-1) on hitting IOException on segments_N): the separate
> ".done" file did not work reliably under very high stress testing
> when a directory listing was not "point in time").
>   * On WIN32, you can now call IndexReader.setNorm() even if other
> readers have the index open (fixes a pre-existing minor bug in
> Lucene).
>   * On WIN32, You can now create an IndexWriter with create=true even
> if readers have the index open (eg see
> www.gossamer-threads.com/lists/lucene/java-user/39265) .
> Here's an overview of the changes:
>   * Every commit writes to the next segments_(N+1).
>   * Loading the segments_N file (& opening the segments) now requires
> retry logic.  I've captured this logic into a new static class:
> SegmentInfos.FindSegmentsFile.  All places that need to do
> something on the current segments file now use this class.
>   * No more deletable file.  Instead, the writer computes what's
> deletable on instantiation and updates this in memory whenever
> files can be deleted (ie, when it commits).  Created a common
> class i

[jira] Commented: (LUCENE-701) Lock-less commits

2006-11-02 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-701?page=comments#action_12446638 ] 

Ning Li commented on LUCENE-701:


Can the following scenario happen with lock-less commits?

1 A reader reads segments.1, which says the index contains seg_1.
2 A writer writes segments.2, which says the index now contains seg_2, and 
deletes seg_1.
3 The reader tries to load seg_1 and fails.


> Lock-less commits
> -
>
> Key: LUCENE-701
> URL: http://issues.apache.org/jira/browse/LUCENE-701
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: index.prelockless.cfs.zip, index.prelockless.nocfs.zip, 
> lockless-commits-patch.txt
>
>
> This is a patch based on discussion a while back on lucene-dev:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200608.mbox/[EMAIL 
> PROTECTED]
> The approach is a small modification over the original discussion (see
> Retry Logic below).  It works correctly in all my cross-machine test
> case, but I want to open it up for feedback, testing by
> users/developers in more diverse environments, etc.
> This is a small change to how lucene stores its index that enables
> elimination of the commit lock entirely.  The write lock still
> remains.
> Of the two, the commit lock has been more troublesome for users since
> it typically serves an active role in production.  Whereas the write
> lock is usually more of a design check to make sure you only have one
> writer against the index at a time.
> The basic idea is that filenames are never reused ("write once"),
> meaning, a writer never writes to a file that a reader may be reading
> (there is one exception: the segments.gen file; see "RETRY LOGIC"
> below).  Instead it writes to generational files, ie, segments_1, then
> segments_2, etc.  Besides the segments file, the .del files and norm
> files (.sX suffix) are also now generational.  A generation is stored
> as an "_N" suffix before the file extension (eg, _p_4.s0 is the
> separate norms file for segment "p", generation 4).
> One important benefit of this is it avoids files contents caching
> entirely (the likely cause of errors when readers open an index
> mounted on NFS) since the file is always a new file.
> With this patch I can reliably instantiate readers over NFS when a
> writer is writing to the index.  However, with NFS, you are still forced to
> refresh your reader once a writer has committed because "point in
> time" searching doesn't work over NFS (see LUCENE-673 ).
> The changes are fully backwards compatible: you can open an old index
> for searching, or to add/delete docs, etc.  I've added a new unit test
> to test these cases.
> All units test pass, and I've added a number of additional unit tests,
> some of which fail on WIN32 in the current lucene but pass with this
> patch.  The "fileformats.xml" has been updated to describe the changes
> to the files (but XXX references need to be fixed before committing).
> There are some other important benefits:
>   * Readers are now entirely read-only.
>   * Readers no longer block one another (false contention) on
> initialization.
>   * On hitting contention, we immediately retry instead of a fixed
> (default 1.0 second now) pause.
>   * No file renaming is ever done.  File renaming has caused sneaky
> access denied errors on WIN32 (see LUCENE-665 ).  (Yonik, I used
> your approach here to not rename the segments_N file(try
> segments_(N-1) on hitting IOException on segments_N): the separate
> ".done" file did not work reliably under very high stress testing
> when a directory listing was not "point in time").
>   * On WIN32, you can now call IndexReader.setNorm() even if other
> readers have the index open (fixes a pre-existing minor bug in
> Lucene).
>   * On WIN32, You can now create an IndexWriter with create=true even
> if readers have the index open (eg see
> www.gossamer-threads.com/lists/lucene/java-user/39265) .
> Here's an overview of the changes:
>   * Every commit writes to the next segments_(N+1).
>   * Loading the segments_N file (& opening the segments) now requires
> retry logic.  I've captured this logic into a new static class:
> SegmentInfos.FindSegmentsFile.  All places that need to do
> something on the current segments file now use this class.
>   * No more deletable file.  Instead, the writer computes what's
> deletable on instantiation and updates this in memory whenever
> files can be deleted (ie, when it commits).  Created a common
> class index.IndexFileDeleter shared by reader & writer, to manage
> deletes.
>   * Storing more information into segments info file: whether it has
> separate deletes (and which generatio

[jira] Commented: (LUCENE-702) Disk full during addIndexes(Directory[]) can corrupt index

2006-11-01 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-702?page=comments#action_12446307 ] 

Ning Li commented on LUCENE-702:


A possible solution to this issue is to check, when writing segment infos to 
"segments" in directory d,
whether dir of a segment info is d, and only write if it is. Suggestions?

The following is my comment on this issue from the mailing list documenting how 
Lucene could
produce an inconsistent index if addIndexes(Directory[]) does not run to its 
completion.

"This makes me notice a bug in current addIndexes(Directory[]). In current 
addIndexes(Directory[]),
segment infos in S are added to T's "segmentInfos" upfront. Then segments in S 
are merged to T
several at a time. Every merge is committed with T's "segmentInfos". So if a 
reader is opened on T
while addIndexes(Directory[]) is going on, it could see an inconsistent index."


> Disk full during addIndexes(Directory[]) can corrupt index
> --
>
> Key: LUCENE-702
> URL: http://issues.apache.org/jira/browse/LUCENE-702
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>
> This is a spinoff of LUCENE-555
> If the disk fills up during this call then the committed segments file can 
> reference segments that were not written.  Then the whole index becomes 
> unusable.
> Does anyone know of any other cases where disk full could corrupt the index?
> I think disk full should worse lose the documents that were "in flight" at 
> the time.  It shouldn't corrupt the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-528) Optimization for IndexWriter.addIndexes()

2006-10-26 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-528?page=all ]

Ning Li updated LUCENE-528:
---

Lucene Fields: [Patch Available]

> Optimization for IndexWriter.addIndexes()
> -
>
> Key: LUCENE-528
> URL: http://issues.apache.org/jira/browse/LUCENE-528
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Tamm
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: AddIndexes.patch, AddIndexesNoOptimize.patch
>
>
> One big performance problem with IndexWriter.addIndexes() is that it has to 
> optimize the index both before and after adding the segments.  When you have 
> a very large index, to which you are adding batches of small updates, these 
> calls to optimize make using addIndexes() impossible.  It makes parallel 
> updates very frustrating.
> Here is an optimized function that helps out by calling mergeSegments only on 
> the newly added documents.  It will try to avoid calling mergeSegments until 
> the end, unless you're adding a lot of documents at once.
> I also have an extensive unit test that verifies that this function works 
> correctly if people are interested.  I gave it a different name because it 
> has very different performance characteristics which can make querying take 
> longer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-686) Resources not always reclaimed in scorers after each search

2006-10-25 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-686?page=comments#action_12444766 ] 

Ning Li commented on LUCENE-686:



But removing TermDocs.close() will leave IndexInput.close() in a
similar half-in/half-out situation: e.g. close() will not be called
for freqStream and skipStream in SegmentTermDocs. Yet
IndexInput.close() cannot be removed (e.g. FSIndexInput).


> Resources not always reclaimed in scorers after each search
> ---
>
> Key: LUCENE-686
> URL: http://issues.apache.org/jira/browse/LUCENE-686
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
> Environment: All
>Reporter: Ning Li
> Attachments: ScorerResourceGC.patch
>
>
> Resources are not always reclaimed in scorers after each search.
> For example, close() is not always called for term docs in TermScorer.
> A test will be attached to show when resources are not reclaimed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-528) Optimization for IndexWriter.addIndexes()

2006-10-24 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-528?page=all ]

Ning Li updated LUCENE-528:
---

Attachment: AddIndexesNoOptimize.patch

This patch implements addIndexesNoOptimize() following the algorithm described 
earlier.
  - The patch is based on the latest version from trunk.
  - AddIndexesNoOptimize() is implemented. The algorithm description is 
included as comment and the code is commented.
  - The patch includes a test called TestAddIndexesNoOptimize which covers all 
the code in addIndexesNoOptimize().
  - maybeMergeSegments() was conservative and checked for more merges only when 
"upperBound * mergeFactor <= maxMergeDocs". Change it to check for more merges 
when "upperBound < maxMergeDocs".
  - Minor changes in TestIndexWriterMergePolicy to better verify merge 
invariants.
  - The patch passes all unit tests.

One more comment on the implementation:
  - When we copy un-merged segments from S in step 4, ideally, we want to 
simply copy
those segments. However, directory does not support copy yet. In addition, 
source may
use compound file or not and target may use compound file or not. So we use
mergeSegments() to copy each segment, which may cause doc count to change
because deleted docs are garbage collected. That case is handled properly.  

> Optimization for IndexWriter.addIndexes()
> -
>
> Key: LUCENE-528
> URL: http://issues.apache.org/jira/browse/LUCENE-528
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Tamm
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: AddIndexes.patch, AddIndexesNoOptimize.patch
>
>
> One big performance problem with IndexWriter.addIndexes() is that it has to 
> optimize the index both before and after adding the segments.  When you have 
> a very large index, to which you are adding batches of small updates, these 
> calls to optimize make using addIndexes() impossible.  It makes parallel 
> updates very frustrating.
> Here is an optimized function that helps out by calling mergeSegments only on 
> the newly added documents.  It will try to avoid calling mergeSegments until 
> the end, unless you're adding a lot of documents at once.
> I also have an extensive unit test that verifies that this function works 
> correctly if people are interested.  I gave it a different name because it 
> has very different performance characteristics which can make querying take 
> longer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-528) Optimization for IndexWriter.addIndexes()

2006-10-20 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12443978 ] 

Ning Li commented on LUCENE-528:


I'll submit a patch next week.

> Optimization for IndexWriter.addIndexes()
> -
>
> Key: LUCENE-528
> URL: http://issues.apache.org/jira/browse/LUCENE-528
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Tamm
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: AddIndexes.patch
>
>
> One big performance problem with IndexWriter.addIndexes() is that it has to 
> optimize the index both before and after adding the segments.  When you have 
> a very large index, to which you are adding batches of small updates, these 
> calls to optimize make using addIndexes() impossible.  It makes parallel 
> updates very frustrating.
> Here is an optimized function that helps out by calling mergeSegments only on 
> the newly added documents.  It will try to avoid calling mergeSegments until 
> the end, unless you're adding a lot of documents at once.
> I also have an extensive unit test that verifies that this function works 
> correctly if people are interested.  I gave it a different name because it 
> has very different performance characteristics which can make querying take 
> longer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-528) Optimization for IndexWriter.addIndexes()

2006-10-20 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12443911 ] 

Ning Li commented on LUCENE-528:


> I think you need to ensure that no segments from the source index "S" remain 
> after the call, right?

Correct. And thanks!

So in step 4, in the case where the invariants hold for the last < M segments 
whose levels <= h,
if some of those < M segments are from S (not merged in step 3), properly copy 
them over.

Algorithm looks good?

This makes me notice a bug in current addIndexes(Directory[]). In current 
addIndexes(Directory[]),
segment infos in S are added to T's "segmentInfos" upfront. Then segments in S 
are merged to T
several at a time. Every merge is committed with T's "segmentInfos". So if a 
reader is opened on T
while addIndexes() is going on, it could see an inconsistent index.

> Optimization for IndexWriter.addIndexes()
> -
>
> Key: LUCENE-528
> URL: http://issues.apache.org/jira/browse/LUCENE-528
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Tamm
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: AddIndexes.patch
>
>
> One big performance problem with IndexWriter.addIndexes() is that it has to 
> optimize the index both before and after adding the segments.  When you have 
> a very large index, to which you are adding batches of small updates, these 
> calls to optimize make using addIndexes() impossible.  It makes parallel 
> updates very frustrating.
> Here is an optimized function that helps out by calling mergeSegments only on 
> the newly added documents.  It will try to avoid calling mergeSegments until 
> the end, unless you're adding a lot of documents at once.
> I also have an extensive unit test that verifies that this function works 
> correctly if people are interested.  I gave it a different name because it 
> has very different performance characteristics which can make querying take 
> longer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-528) Optimization for IndexWriter.addIndexes()

2006-10-20 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12443723 ] 

Ning Li commented on LUCENE-528:


We want a robust algorithm for the version of addIndexes() which
does not call optimize().

The robustness can be expressed as the two invariants guaranteed
by the merge policy for adding documents (if mergeFactor M does not
change and segment doc count is not reaching maxMergeDocs):
  B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B)))
  1: If i (left*) and i+1 (right*) are two consecutive segments of doc
  counts x and y, then f(x) >= f(y).
  2: The number of committed segments on the same level (f(n)) <= M.

References are at http://www.gossamer-threads.com/lists/lucene/java-dev/35147,
LUCENE-565 and LUCENE-672.

AddIndexes() can be viewed as adding a sequence of segments S to
a sequence of segments T. Segments in T follow the invariants but
segments in S may not since they could come from multiple indexes.
Here is the merge algorithm for addIndexes():

1. Flush ram segments.

2. Consider a combined sequence with segments from T followed
by segments from S (same as current addIndexes()).

3. Assume the highest level for segments in S is h. Call maybeMergeSegments(),
but instead of starting w/ lowerBound = -1 and upperBound = maxBufferedDocs,
start w/ lowerBound = -1 and upperBound = upperBound of level h.
After this, the invariants are guaranteed except for the last < M segments
whose levels <= h.

4. If the invariants hold for the last < M segments whose levels <= h, done.
Otherwise, simply merge those segments. If the merge results in
a segment of level <= h, done. Otherwise, it's of level h+1 and call
maybeMergeSegments() starting w/ upperBound = upperBound of level h+1.

Suggestions?

> Optimization for IndexWriter.addIndexes()
> -
>
> Key: LUCENE-528
> URL: http://issues.apache.org/jira/browse/LUCENE-528
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Tamm
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: AddIndexes.patch
>
>
> One big performance problem with IndexWriter.addIndexes() is that it has to 
> optimize the index both before and after adding the segments.  When you have 
> a very large index, to which you are adding batches of small updates, these 
> calls to optimize make using addIndexes() impossible.  It makes parallel 
> updates very frustrating.
> Here is an optimized function that helps out by calling mergeSegments only on 
> the newly added documents.  It will try to avoid calling mergeSegments until 
> the end, unless you're adding a lot of documents at once.
> I also have an extensive unit test that verifies that this function works 
> correctly if people are interested.  I gave it a different name because it 
> has very different performance characteristics which can make querying take 
> longer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-686) Resources not always reclaimed in scorers after each search

2006-10-17 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-686?page=comments#action_12442987 ] 

Ning Li commented on LUCENE-686:


> Is there an actual memory leak problem related to this? 

Right now no. For example, in FS based directories, the index inputs term docs 
use are clones.
Close() of cloned index inputs does not close the file descriptor. Only the 
origianl one does.

However, memory leak could happen to a new subclass of directory and index 
input, if cloned
instances require reclaiming resources. In addition, memory leak could happen 
to a new subclass
of scorer, if there are resources associated with the scorer which should be 
reclaimed when done.

> In ReqExclScorer the two scorers can also be closed when they are set to 
> null. 

Thanks for pointing this out. I'll double check all scorers and make sure 
close() are properly called.

> It's probably better to use try/finally in IndexSearcher and call close in in 
> the finally clause, 
> exceptions are occasionally used to preliminary end a search, although not in 
> the 
> lucene core afaik. 

Will do. Thanks again!

Cheers,
Ning

> Resources not always reclaimed in scorers after each search
> ---
>
> Key: LUCENE-686
> URL: http://issues.apache.org/jira/browse/LUCENE-686
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
> Environment: All
>Reporter: Ning Li
> Attachments: ScorerResourceGC.patch
>
>
> Resources are not always reclaimed in scorers after each search.
> For example, close() is not always called for term docs in TermScorer.
> A test will be attached to show when resources are not reclaimed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-686) Resources not always reclaimed in scorers after each search

2006-10-17 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-686?page=all ]

Ning Li updated LUCENE-686:
---

Attachment: ScorerResourceGC.patch

A patch is attached:
  - The patch is based on the lastest version from trunk.
  - The patch includes a test called TestScorerResourceGJ which shows resources 
are not reclaimed after each search without the patch.
  - The patch passes TestScorerResourcesGJ.
  - The patch passes all the unit tests.

> Resources not always reclaimed in scorers after each search
> ---
>
> Key: LUCENE-686
> URL: http://issues.apache.org/jira/browse/LUCENE-686
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
> Environment: All
>Reporter: Ning Li
> Attachments: ScorerResourceGC.patch
>
>
> Resources are not always reclaimed in scorers after each search.
> For example, close() is not always called for term docs in TermScorer.
> A test will be attached to show when resources are not reclaimed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-686) Resources not always reclaimed in scorers after each search

2006-10-17 Thread Ning Li (JIRA)
Resources not always reclaimed in scorers after each search
---

 Key: LUCENE-686
 URL: http://issues.apache.org/jira/browse/LUCENE-686
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
 Environment: All
Reporter: Ning Li


Resources are not always reclaimed in scorers after each search.

For example, close() is not always called for term docs in TermScorer.

A test will be attached to show when resources are not reclaimed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-21 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: NewIndexModifier.Sept21.patch

This is to update the delete-support patch after the commit of the new merge 
policy.
  - Very few changes to IndexWriter.
  - The patch passes all tests.
  - A new test call TestNewIndexModifierDelete is added to show different 
scenarios when using delete methods in NewIndexModifier.

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interlea

[jira] Commented: (LUCENE-672) new merge policy

2006-09-18 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-672?page=comments#action_12435571 ] 

Ning Li commented on LUCENE-672:


> Should lowerBound start off as -1 in maybeMergeSegments if we keep 0 sized 
> segments?

Good catch! Although the rightmost disk segment cannot be a 0-sized segment 
right now, it could be when NewIndexModifier is in.

Shoud I submit a new patch?


> new merge policy
> 
>
> Key: LUCENE-672
> URL: http://issues.apache.org/jira/browse/LUCENE-672
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.0.0
>Reporter: Yonik Seeley
> Assigned To: Yonik Seeley
> Fix For: 2.1
>
>
> New merge policy developed in the course of 
> http://issues.apache.org/jira/browse/LUCENE-565
> http://issues.apache.org/jira/secure/attachment/12340475/newMergePolicy.Sept08.patch

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-672) new merge policy

2006-09-15 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-672?page=comments#action_12435174 ] 

Ning Li commented on LUCENE-672:


A small fix named KeepDocCount0Segment.Sept15.patch is attached to LUCENE-565 
(can't attach here).

In mergeSegments(...), if the doc count of a merged segment is 0, it is not 
added to the index (it should be properly cleaned up). Before LUCENE-672, a 
merged segment was always added to the index. The use of mergeSegments(...) in, 
e.g. addIndexes(Directory[]), assumed that behaviour. For code simplicity, this 
fix restores the old behaviour that a merged segment is always added to the 
index. This does NOT break any of the good properties of the new merge policy.

TestIndexWriterMergePolicy is slightly modified to fix a bug and to check that 
segments are probably cleaned up. The patch passes all the tests.

> new merge policy
> 
>
> Key: LUCENE-672
> URL: http://issues.apache.org/jira/browse/LUCENE-672
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.0.0
>Reporter: Yonik Seeley
> Assigned To: Yonik Seeley
> Fix For: 2.1
>
>
> New merge policy developed in the course of 
> http://issues.apache.org/jira/browse/LUCENE-565
> http://issues.apache.org/jira/secure/attachment/12340475/newMergePolicy.Sept08.patch

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-08 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: newMergePolicy.Sept08.patch

This patch features the new more robust merge policy. Reference on the new 
policy is at http://www.gossamer-threads.com/lists/lucene/java-dev/35147
  - The patch passes all the tests except that one in TestIndexModifier (see an 
earlier comment on this issue).
  - Since the test itself has a problem, it is fixed (one line change) and the 
patch passes the fixed test.
  - A new test call TestIndexWriterMergePolicy is included which shows the 
robustness of the new merge policy.


The following is a detailed description of the new merge policy and its 
properties.

 Overview of merge policy:

 A flush is triggered either by close() or by the number of ram segments
 reaching maxBufferedDocs. After a disk segment is created by the flush,
 further merges may be triggered.

 LowerBound and upperBound set the limits on the doc count of a segment
 which may be merged. Initially, lowerBound is set to 0 and upperBound
 to maxBufferedDocs. Starting from the rightmost* segment whose doc count
 > lowerBound and <= upperBound, count the number of consecutive segments
 whose doc count <= upperBound.

 Case 1: number of worthy segments < mergeFactor, no merge, done.
 Case 2: number of worthy segments == mergeFactor, merge these segments.
 If the doc count of the merged segment <= upperBound, done.
 Otherwise, set lowerBound to upperBound, and multiply upperBound
 by mergeFactor, go through the process again.
 Case 3: number of worthy segments > mergeFactor (in the case mergeFactor
 M changes), merge the leftmost* M segments. If the doc count of
 the merged segment <= upperBound, consider the merged segment for
 further merges on this same level. Merge the now leftmost* M
 segments, and so on, until number of worthy segments < mergeFactor.
 If the doc count of all the merged segments <= upperBound, done.
 Otherwise, set lowerBound to upperBound, and multiply upperBound
 by mergeFactor, go through the process again.
 Note that case 2 can be considerd as a special case of case 3.

 This merge policy guarantees two invariants if M does not change and
 segment doc count is not reaching maxMergeDocs:
 B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B)))
  1: If i (left*) and i+1 (right*) are two consecutive segments of doc
 counts x and y, then f(x) >= f(y).
  2: The number of committed segments on the same level (f(n)) <= M.


> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, 
> NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, 
> perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, 
> TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by bu

[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-23 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12430130 ] 

Ning Li commented on LUCENE-565:


Doron, thank you very much for the review! I want to briefly comment
on one of your comments:

> (5) deleteDocument(int doc) not implemented

I deliberately left that one out. This is because document ids are
changing as documents are deleted and segments are merged. Users
don't know exactly when segments are merged thus ids are changed
when using IndexModifier. Thus I don't think it should be supported
in IndexModifier at all.


> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, 
> NewIndexWriter.July18.patch, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> N

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-23 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: NewIndexWriter.Aug23.patch

> Yes I am including this patch as it is very useful for increasing
> the efficiency of updates as you described.  I will be conducting
> more tests and will post any results.  Yes a patch for IndexWriter
> will be useful so that the entirety of this build will work.
> Thanks!

I've attached a patch that works with the current code. The
implementation of IndexWriter and NewIndexModifier is the same as
the last patch. I removed the "singleDocSegmentsCount" optimization
from this patch since my IndexWriter checks singleDocSegmentsCount
by simply calling ramSegmentInfos.size().

This patch had evolved with the help of many good discussions
(thanks!) since it came out in May. Here is the current state of
the patch:
  - This patch aims at enabling users to do inserts and general
deletes (delete-by-term, and later delete-by-query) without
switching between writers and readers.
  - The goal is achieved by rewritting IndexWriter in such a way
that semantically it's the same as before, but it provides
extension points so that delete-by-term, delete-by-query, and
more functionalities can be easily supported in a subclass.
  - NewIndexModifier extends IndexWriter and supports delete-by-term
by simply overriding two methods: toFlushRamSegment() which
decides if a flush should happen, and doAfterFlushRamSegments()
which does proper work after a flush is done.

Suggestions are welcome! Especially those that may help it get
committed. :-)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, 
> NewIndexWriter.July18.patch, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A

[jira] Commented: (LUCENE-528) Optimization for IndexWriter.addIndexes()

2006-08-16 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12428478 ] 

Ning Li commented on LUCENE-528:


In an email thread titled "LUCENE-528 and 565", I described a weakness of the 
proposed solution:

"I'm totally for a version of addIndexes() where optimize() is not always 
called. However, with the one proposed in the patch, we could end up with an 
index where: segment 0 has 1000 docs, 1 has 2000, 2 has 4000, 3 has 8000, etc. 
while Lucene desires the reverse. Or we could have a sandwich index where: 
segment 0 has 4000 docs, 1 has 100, 2 has 100, 3 has 4000. While neither of 
these will occur if you use addIndexesNoOpt() carefully, there should be a more 
robust merge policy."

Here is an alternative solution which merges segements so that the docCount of 
segment i is at least twice as big as the docCount of segment i+1. If we are 
willing to make it a bit more complicated, we can take merge factor into 
consideration.


  public synchronized void addIndexesNoOpt(Directory[] dirs) throws IOException 
{
for (int i = 0; i < dirs.length; i++) {
  SegmentInfos sis = new SegmentInfos(); // read infos from dir
  sis.read(dirs[i]);
  for (int j = 0; j < sis.size(); j++) {
segmentInfos.addElement(sis.info(j)); // add each info
  }
}

int start = 0;
int docCountFromStart = docCount();

while (start < segmentInfos.size()) {
  int end;
  int docCountToMerge = 0;

  if (docCountFromStart <= minMergeDocs) {
// if the total docCount of the remaining segments
// is lte minMergeDocs, merge all of them
end = segmentInfos.size() - 1;
docCountToMerge = docCountFromStart;
  }
  else {
// otherwise, merge some segments so that the docCount
// of these segments is at least half of the remaining
for (end = start; end < segmentInfos.size(); end++) {
  docCountToMerge += segmentInfos.info(end).docCount;
  if (docCountToMerge >= docCountFromStart / 2) {
break;
  }
}
  }
  
  mergeSegments(start, end + 1);
  start++;
  docCountFromStart -= docCountToMerge;
}
  }


> Optimization for IndexWriter.addIndexes()
> -
>
> Key: LUCENE-528
> URL: http://issues.apache.org/jira/browse/LUCENE-528
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Tamm
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: AddIndexes.patch
>
>
> One big performance problem with IndexWriter.addIndexes() is that it has to 
> optimize the index both before and after adding the segments.  When you have 
> a very large index, to which you are adding batches of small updates, these 
> calls to optimize make using addIndexes() impossible.  It makes parallel 
> updates very frustrating.
> Here is an optimized function that helps out by calling mergeSegments only on 
> the newly added documents.  It will try to avoid calling mergeSegments until 
> the end, unless you're adding a lot of documents at once.
> I also have an extensive unit test that verifies that this function works 
> correctly if people are interested.  I gave it a different name because it 
> has very different performance characteristics which can make querying take 
> longer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-18 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: NewIndexWriter.July18.patch

Hopefully, third time's a charm. :-)

I rewrote IndexWriter in such a way that semantically it's the same as before, 
but it provides extension points so that delete-by-term, delete-by-query, and 
more functionalities can be easily supported in a subclass. NewIndexModifier is 
such a subclass that supports delete-by-term.

Here is an overview of the changes:

Changes to IndexWriter
Changes to IndexWriter variables:
  - segmentInfos used to store the info of all segments (on disk or in ram). 
Now it
only stores the info of segments on disk.
  - ramSegmentInfos is a new variable which stores the info of just ram 
segments.
Changes to IndexWriter methods:
  - addDocument()
The info of the new ram segment is added to ramSegmentInfos.
  - maybeMergeSegments()
toFlushRamSegments() is called at the beginning to decide whether a flush 
should take place.
  - flushRamSegments()
doAfterFlushRamSegments() is called after all ram segments are merged and 
flushed to disk.

NewIndexModifier
New variables:
  - bufferedDeleteTerms is a new variable which buffers delete terms
before they are applied.
  - maxBufferedDeleteTerms is similar to maxBufferedDocs. It controls
the max number of delete terms that can be buffered before they
must be flushed to disk.
Overloaded/new methods:
  - deleteDocuments(), batchDeleteDocuments()
The terms are added to bufferedDeleteTerms. bufferedDeleteTerms
also records the current number of documents buffered in ram,
so the delete terms can be applied to ram segments as well as
the segments on disk.
  - toFlushRamSegments()
In IndexWriter, a flush would be triggered only if enough documents were
buffered. Now a flush is triggered if enough documents are
buffered OR if enough delete terms are buffered.
  - doAfterlushRamSegments()
Step 1: Apply buffered delete terms to all the segments on disk.
Step 2: Apply buffered delete terms to the new segment appropriately,
so that a delete term is only applied to the documents
buffered before it, but not to those buffered after it.
Step 3: Clean up the buffered delete terms.

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, NewIndexModifier.July09.patch, 
> NewIndexWriter.July18.patch, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-09 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: IndexWriter.July09.patch
NewIndexModifier.July09.patch

Hi Otis,

I've attached two patch files:
  - IndexWriter.July09.patch is an updated version of the old patch.
  - NewIndexModifier.July09.patch makes minimal changes to IndexWriter and puts 
new functionalities in a new class called NewIndexModifier. I didn't name it 
IndexModifier because the two are unrelated and I don't want a diff of the two.

All unit test succeeded except the following one:
[junit] Testcase: testIndex(org.apache.lucene.index.TestIndexModifier): 
FAILED
[junit] expected:<3> but was:<4>
[junit] junit.framework.AssertionFailedError: expected:<3> but was:<4>
[junit] at 
org.apache.lucene.index.TestIndexModifier.testIndex(TestIndexModifier.java:67)

However, the unit test has a problem, not the patch: IndexWriter's docCount() 
does not tell the actual number of documents in an index, only IndexReader's 
numDocs() does. For example, in a similar test below, where 10 documents are 
added, then 1 deleted, then 2 added, the last call to docCount() returns 12, 
not 11, with or without the patch.

  public void testIndexSimple() throws IOException {
Directory ramDir = new RAMDirectory();
IndexModifier i = new IndexModifier(ramDir, new StandardAnalyzer(), true);
// add 10 documents initially
for (int count = 0; count < 10; count++) {
   i.addDocument(getDoc());
}
i.flush();
i.optimize();
assertEquals(10, i.docCount());
i.deleteDocument(0);
i.flush();
assertEquals(9, i.docCount());
i.addDocument(getDoc());
i.addDocument(getDoc());
i.flush();
assertEquals(12, i.docCount());
  }

The reason for the docCount() difference in the unit test (which does not 
affect the correctness of the patch) is that flushRamSegments() in the patch 
merges all and only the segments in ram and write to disk, whereas the original 
flushRamSegments() merges not only the segments in ram but *sometimes* also one 
segment from disk (see in that function the comment "// add one FS segment?").

Regards,
Ning

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
>  Key: LUCENE-565
>  URL: http://issues.apache.org/jira/browse/LUCENE-565
>  Project: Lucene - Java
> Type: Bug

>   Components: Index
> Reporter: Ning Li
>  Attachments: IndexWriter.July09.patch, IndexWriter.java, IndexWriter.patch, 
> NewIndexModifier.July09.patch, TestWriterDelete.java
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Resul

  1   2   >