subject:"\[jira\] Commented\: \(LUCENE\-2680\) Improve how IndexWriter flushes deletes against existing segments"

[jira] [Commented] (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2011-10-28 Thread Roman Alekseenkov (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138649#comment-13138649
 ] 

Roman Alekseenkov commented on LUCENE-2680:
---

Hey, is it something that was ported to 3.x, or not really?


> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2011-10-28 Thread Robert Muir (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138661#comment-13138661
 ] 

Robert Muir commented on LUCENE-2680:
-

Hi, this was backported since lucene 3.1

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2011-10-28 Thread Roman Alekseenkov (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138695#comment-13138695
 ] 

Roman Alekseenkov commented on LUCENE-2680:
---

thank you, Robert

I was asking because we are having issues with 3.4.0 where applyDeletes() takes 
an large amount of time on commit for 150GB index, and this is stopping all 
indexing threads. it looks like applyDeletes() is re-scanning an entire index, 
even though it's unnecessary as we are only adding documents to the index but 
not deleting them

if this optimization was backported, then I will probably have to find a 
solution for my problem elsewhere...

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2011-10-28 Thread Robert Muir (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138700#comment-13138700
 ] 

Robert Muir commented on LUCENE-2680:
-

{quote}
even though it's unnecessary as we are only adding documents to the index but 
not deleting them
{quote}

Hi Roman, i saw your post.

I think by default when you add a document with unique id X, Solr 
deletes-by-term of X.

But I'm pretty sure it has an option (sorry i dont know what it is), where you 
can tell it 
that you are sure that the documents you are adding are new and it won't do 
this.


> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-10-02 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917174#action_12917174
 ] 

Michael McCandless commented on LUCENE-2680:


Hmm... I think there's another silliness going on inside IW: when applying 
deletes, we one-by-one open the SR, apply deletes, close it.

But then immediately thereafter we open the N segments to be merged.

We should somehow not do this double open, eg, use the pool temporarily, so 
that the reader is opened to apply deletes, and then kept open in order to do 
the merging.  Using the pool should be fine because the merge forcefully evicts 
the sub readers from the pool after completion.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-10-09 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919549#action_12919549
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Maybe we should implement this as pending deletes per segment rather than using 
a generational system because with LUCENE-2655, we may need to maintain the per 
query/term docidupto per segment.  The downside is the extraneous memory 
consumed by the hash map, however, if we use BytesRefHash this'll be reduced, 
or would it?  Because we'd be writing the term bytes to a unique byte pool per 
segment?  Hmm... Maybe there's a more efficient way.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-10-11 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919780#action_12919780
 ] 

Michael McCandless commented on LUCENE-2680:


Tracking per-segment would be easier but I worry about indices that have large 
numbers of segments... eg w/ a large mergeFactor and frequent flushing you can 
get very many segments.

So if we track per-segment, suddenly the RAM required (as well as CPU cost of 
copying these deletions to the N segments) is multiplied by the number of 
segments.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-02 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927488#action_12927488
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

bq. The most recent one "wins", and we should do only one delete (per segment) 
for that term.

How should we define this recency and why does it matter?  Should it be per 
term/query or for the entire BD?

I think there's an issue with keeping lastSegmentIndex in DW, while it's easy 
to maintain, Mike had mentioned keeping the lastSegmentIndex per 
BufferedDeletes object.  Coalescing the BDs should be easier to maintain after 
successful merge than maintaining a separate BD for them.  We'll see.

I'll put together another patch with these changes.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-03 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927943#action_12927943
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I'm redoing things a bit to take into account the concurrency of merges.  For 
example, if a merge fails, we need to not have removed those segments' deletes 
to be applied.  Also probably the most tricky part is that lastSegmentIndex 
could have changed since a merge started, which means we need to be careful 
about how and which deletes we coalesce.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-03 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927979#action_12927979
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Another use case that can be wacky is if commit is called and a merge is 
finishing before or after, in that case all (point-in-time) deletes will have 
been applied by commit, however do we want to clear all per-segment deletes at 
the end of commit?  This would blank out deletes being applied by the merge, 
most of which should be cleared out, however if new deletes arrived during the 
commit (is this possible?), then we want these to be attached to segments and 
not lost.  I guess we want to DW sync'd clear out deletes in the 
applyDeletesAll method.  ADA will apply those deletes, any incoming will queue 
up and be shuffled around.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-03 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928075#action_12928075
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

There's an issue in that we're redundantly applying deletes in the 
applyDeletesAll case because the deletes may have already been applied to a 
segment when a merge happened, ie, by applyDeletesToSegments.  In the ADA case 
we need to use applyDeletesToSegments up to the segment point when the buffered 
deletes can be used.  

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-03 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928078#action_12928078
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

This brings up another issue which is we're blindly iterating over docs in a 
segment reader to delete, even if we can know ahead of time that the reader's 
docs are going to exceed the term/query's docid-upto (from the max doc of the 
reader).  In applyDeletes we're opening a term docs iterator, though I think 
we're breaking at the first doc and moving on if the docid-upto is exceeded.  
This term docs iterator opening can be skipped.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-05 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928923#action_12928923
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

All tests pass except org.apache.lucene.index.TestIndexWriterMergePolicy 
testMaxBufferedDocsChange.  Odd.  I'm looking into this.

{code}
[junit] junit.framework.AssertionFailedError: maxMergeDocs=2147483647; 
numSegments=11; upperBound=10; mergeFactor=10; 
segs=_65:c5950 _5t:c10->_32 _5u:c10->_32 _5v:c10->_32 _5w:c10->_32 _5x:c10->_32 
_5y:c10->_32 _5z:c10->_32 _60:c10->_32 _61:c10->_32 _62:c3->_32 _64:c7->_62
{code}

Also, in IW deleteDocument(*) we're calling a new method, getSegmentInfos which 
is sync'ed on IW.  Maybe we should use an atomic reference to a read only 
segment infos instead?

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-05 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928933#action_12928933
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Sorry, spoke too soon, I made a small change to not redundantly delete, in 
apply deletes all and TestStressIndexing2 is breaking.  I think we need to 
"push" segment infos changes to DW as they happen.  I'm guessing that segment 
infos are being shuffled around and so the infos passed into DW in IW deleteDoc 
methods may be out of date by the time deletes are attached to segments.  
Hopefully there aren't any lurking deadlock issues with this.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-06 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929229#action_12929229
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Pushing the segment infos seems to have cleared up some of the tests failing, 
however intermittently (1/4 of the time) there's the one below.

I'm going to re-add lastSegmentInfo/Index, and assert that if we're not using 
it, that the deletes obtained from the segmentinfo -> deletes map is the same.  

{code}
[junit] Testsuite: org.apache.lucene.index.TestStressIndexing2
[junit] Testcase: testRandom(org.apache.lucene.index.TestStressIndexing2):  
FAILED
[junit] expected:<12> but was:<11>
[junit] junit.framework.AssertionFailedError: expected:<12> but was:<11>
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:278)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:271)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.testRandom(TestStressIndexing2.java:89)
{code}

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-06 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929247#action_12929247
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I wasn't coalescing the merged segment's deletes, with that implemented, 
TestStressIndexing2 ran successfully 49 of 50 times.  Below is the error:

{code}
[junit] Testsuite: org.apache.lucene.index.TestStressIndexing2
[junit] Testcase: 
testMultiConfig(org.apache.lucene.index.TestStressIndexing2): FAILED
[junit] expected:<5> but was:<4>
[junit] junit.framework.AssertionFailedError: expected:<5> but was:<4>
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:278)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:271)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:115)
{code}

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-06 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929254#action_12929254
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Putting a sync on DW block around the bulk of the segment alterations in IW 
commitMerge seems to have quelled the TestStressIndexing2 test failures.  Nice.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-07 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929424#action_12929424
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

In DW abort (called by IW rollbackInternal) we should be able to simply clear 
all per segment pending deletes, however, I'm not sure we can do that, in fact, 
if we have applied deletes for a merge, then we rollback, we can't undo those 
deletes thereby breaking our current rollback model?

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-08 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929629#action_12929629
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I'm running test-core multiple times and am seeing some lurking test
failures (thanks to the randomized tests that have been recently added).
I'm guessing they're related to the syncs on IW and DW not being in "sync"
some of the time. 

I will clean up the patch so that others may properly review it and
hopefully we can figure out what's going on. 

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-08 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929810#action_12929810
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

The problem could be that IW deleteDocument is not synced on IW,
when I tried adding the sync, there was deadlock perhaps from DW
waitReady. We could be adding pending deletes to segments that
are not quite current because we're not adding them in an IW
sync block.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-08 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929927#action_12929927
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Ok, TestThreadedOptimize works when the DW sync'ed pushSegmentInfos method
isn't called anymore (no extra per-segment deleting is going on), and stops
working when pushSegmentInfos is turned back on. Something about the sync
on DW is causing a problem.  Hmm... We need another way to pass segment
infos around consistently. 

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-10 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930658#action_12930658
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I think I've taken LUCENE-2680 as far as I can, though I'll
probably add some more assertions in there for good measure,
such as whether or not a delete has in fact been applied etc. It
seems to be working though again I should add more assertions to
that effect. I think there's a niggling sync issue in there as
TestThreadedOptimize only fails when I try to run it 100s of
times. I think the sync on DW is causing a wait notify to be
missed or skipped or something like that, as occasionally the
isOptimized call fails as well. This is likely related to the
appearance of deletes not being applied to segment(s) as
evidenced by the difference in the actual doc count and the
expected doc count.

Below is the most common assertion failure. Maybe I should
upload my patch that includes a method that iterates 200 times
on testThreadedOptimize?

{code}
[junit] -  ---
[junit] Testsuite: org.apache.lucene.index.TestThreadedOptimize
[junit] Testcase: 
testThreadedOptimize(org.apache.lucene.index.TestThreadedOptimize):   FAILED
[junit] expected:<248> but was:<266>
[junit] junit.framework.AssertionFailedError: expected:<248> but was:<266>
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
[junit] at 
org.apache.lucene.index.TestThreadedOptimize.runTest(TestThreadedOptimize.java:119)
[junit] at 
org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize(TestThreadedOptimize.java:141)
[junit] 
[junit] 
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.748 sec
{code}

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-10 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930845#action_12930845
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I think I've isolated this test failure to recording the applied deletes.
Because we're using last segment index/info, I was adding deletes that may
or may not have been applied to a particular segment to the last segment
info. I'm not sure what to do in this case as if we record the applied
terms per segment, but keep the pending terms in last segment info, we're
effectively not gaining anything from using last segment info because
we're then recording all of the terms per-segment anyways. In fact, this
is how I've isolated that this is the issue, I simply removed the usage of
last segment info, and instead went to maintaining pending deletes
per-segment. I'll give it some thought.

In conclusion, when deletes are recorded per-segment with no last segment
info, the test passes after 200 times. 

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-10 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930949#action_12930949
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Alright, we needed to clone the per-segment pending deletes in the
_mergeInit prior to the merge, like cloning the SRs. There were other
terms arriving after they were applied to a merge, then the coalescing of
applied deletes was incorrect. I believe that this was the remaining
lingering issue. The previous failures seem to have gone away, I ran the
test 400 times. I'll upload a new patch shortly.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-11 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931131#action_12931131
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I'm still seeing the error no matter what I do. Sometimes the index is not
optimized, and sometimes there are too many docs. It requires thousands of
iterations to provoke either test error. Perhaps it's simply related to
merges that are scheduled but IW close isn't waiting on properly.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-11 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931133#action_12931133
 ] 

Michael McCandless commented on LUCENE-2680:


TestThreadedOptimize is a known intermittent failure -- I'm trying to track it 
down!!  (LUCENE-2618)

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-11 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931143#action_12931143
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Ah, nice, I should have looked for previous intermittent failures via Jira.  

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-15 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1293#action_1293
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Now that the intermittent failures have been successfully dealt with, ie,
LUCENE-2618, LUCENE-2576, and LUCENE-2118, I'll merge this patch to trunk,
then it's probably time for benchmarking. That'll probably include
something like indexing, then updating many documents and comparing the
index time vs. trunk? 

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-16 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932508#action_12932508
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Straight indexing and deleting will probably not show much of an
improvement from this patch. In trunk, apply deletes (all) is called on
all segments prior to a merge, so we need a synthetic way to measure the
improvement. One way is to monitor the merge time of small segments (of an
index with many deletes, and many existing large segments) with this patch
vs. trunk. This'll show that this patch in that case is faster (because
we're only applying deletes to the smaller segments). 

I think I'll add a merge start time variable to OneMerge that'll be set in
mergeinit. The var could also be useful for the info stream debug log. The
benchmark will simply print out the merge times (which'll be manufactured
synthetically). 

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932915#action_12932915
 ] 

Michael McCandless commented on LUCENE-2680:


Why do we still have deletesFlushed?  And why do we still need to
remap docIDs on merge?  I thought with this new approach the docIDUpto
for each buffered delete Term/Query would be a local docID to that
segment?

On flush the deletesInRAM should be carried directly over to the
segmentDeletes, and there shouldn't be a deletesFlushed?

A few other small things:

  * You can use SegmentInfos.clone to copy the segment infos? (it
makes a deep copy)

  * SegmentDeletes.clearAll() need not iterate through the
terms/queries to subtract the RAM used?  Ie just multiply by
.size() instead and make one call to deduct RAM used?

  * The SegmentDeletes use less than BYTES_PER_DEL_TERM because it's a
simple HashSet not a HashMap?  Ie we are over-counting RAM used
now?  (Same for by query)

  * Can we store segment's deletes elsewhere?  The SegmentInfo should
be a lightweight class... eg it's used by DirectoryReader to read
the index, and if it's read only DirectoryReader there's no need
for it to allocate the SegmentDeletes?  These data structures
should only be held by IndexWriter/DocumentsWriter.

  * Do we really need to track appliedTerms/appliedQueries?  Ie is
this just an optimization so that if the caller deletes by the
Term/Query again we know to skip it?  Seems unnecessary if that's
all...


> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932945#action_12932945
 ] 

Michael McCandless commented on LUCENE-2680:


Also: why are we tracking the last segment info/index?  Ie, this should only be 
necessary on cutover to DWPT right?  Because effectively today we have only a 
single "DWPT"?

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-17 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932988#action_12932988
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

{quote}Why do we still have deletesFlushed? And why do we still need to
remap docIDs on merge? I thought with this new approach the docIDUpto for
each buffered delete Term/Query would be a local docID to that
segment?{quote}

Deletes flushed can be removed if we store the docid-upto per segment.
Then we'll go back to having a hash map of deletes. 

{quote}The SegmentDeletes use less than BYTES_PER_DEL_TERM because it's a
simple HashSet not a HashMap? Ie we are over-counting RAM used now? (Same
for by query){quote}

Intuitively, yes, however here's the constructor of hash set:

{code} public HashSet() { map = new HashMap(); } {code}

bq. why are we tracking the last segment info/index?

I thought last segment was supposed to be used to mark the last segment of
a commit/flush. This way we save on the hash(set,map) space on the
segments upto the last segment when the commit occurred.

{quote}Can we store segment's deletes elsewhere?{quote}

We can, however I had to minimize places in the code that were potentially
causing errors (trying to reduce the problem set, which helped locate the
intermittent exceptions), syncing segment infos with the per-segment
deletes was one was one of those places. That and I thought it'd be worth
a try simplify (at the expense of breaking the unstated intention of the
SI class).

{quote}Do we really need to track appliedTerms/appliedQueries? Ie is this
just an optimization so that if the caller deletes by the Term/Query again
we know to skip it? {quote}

Yes to the 2nd question. Why would we want to try deleting multiple times?
The cost is the terms dictionary lookup which you're saying is in the
noise? I think potentially cracking open a query again could be costly in
cases where the query is indeed expensive.

{quote}not iterate through the terms/queries to subtract the RAM
used?{quote}

Well, the RAM usage tracking can't be completely defined until we finish
how we're storing the terms/queries. 

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933067#action_12933067
 ] 

Michael McCandless commented on LUCENE-2680:



{quote}
Deletes flushed can be removed if we store the docid-upto per segment.
Then we'll go back to having a hash map of deletes.
{quote}

I think we should do this?

Ie, each flushed segment stores the map of del Term/Query to
docid-upto, where that docid-upto is private to the segment (no
remapping on merges needed).

When it's time to apply deletes to about-to-be-merged segments, we
must apply all "future" segments deletions unconditionally to each
segment, and then conditionally (respecting the local docid-upto)
apply that segment's deletions.

{quote}
Intuitively, yes, however here's the constructor of hash set:

{noformat}
public HashSet() { map = new HashMap(); }
{noformat}
{quote}

Ugh I forgot about that.  Is that still true?  That's awful.

{quote}
bq. why are we tracking the last segment info/index?

I thought last segment was supposed to be used to mark the last segment of
a commit/flush. This way we save on the hash(set,map) space on the
segments upto the last segment when the commit occurred.
{quote}

Hmm... I think lastSegment was needed only for the multiple DWPT
case, to record the last segment already flushed in the index as of
when that DWPT was created.  This is so we know "going back" when we
can start unconditionally apply the buffered delete term.

With the single DWPT we effectively have today isn't last segment
always going to be what we just flushed?  (Or null if we haven't yet
done a flush in the current session).

{quote}
bq. Do we really need to track appliedTerms/appliedQueries? Ie is this just an 
optimization so that if the caller deletes by the Term/Query again we know to 
skip it?

Yes to the 2nd question. Why would we want to try deleting multiple times?
The cost is the terms dictionary lookup which you're saying is in the
noise? I think potentially cracking open a query again could be costly in
cases where the query is indeed expensive.
{quote}

I'm saying this is unlikely to be worthwhile way to spend RAM.

EG most apps wouldn't delete by same term again, like they'd
"typically" go and process a big batch of docs, deleting by an id
field and adding the new version of the doc, where a given id is seen
only once in this session, and then IW is committed/closed?


> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-17 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933082#action_12933082
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

DWPT deletes has perhaps confused this issue a little bit. 

{quote}Tracking per-segment would be easier but I worry about indices that
have large numbers of segments... eg w/ a large mergeFactor and frequent
flushing you can get very many segments.{quote}

I think we may be back tracking here as I had earlier proposed we simply
store each term/query in a map per segment, however I think that was nixed
in favor of last segment + deletes per segment afterwards. We're not
worried about the cost of storing pending deletes in a map per segment
anymore?

{quote}With the single DWPT we effectively have today isn't last segment
always going to be what we just flushed? (Or null if we haven't yet done a
flush in the current session).{quote}

Pretty much. 

{quote}EG most apps wouldn't delete by same term again, like they'd
"typically" go and process a big batch of docs, deleting by an id field
and adding the new version of the doc, where a given id is seen only once
in this session, and then IW is committed/closed?{quote}

In an extreme RT app that uses Lucene like a database, it could in fact
update a doc many times, then we'd start accumulating and deleting the
same ID over and over again. However in the straight batch indexing model
outlined, that is unlikely to happen. 

{quote}When it's time to apply deletes to about-to-be-merged segments, we
must apply all "future" segments deletions unconditionally to each
segment, and then conditionally (respecting the local docid-upto) apply
that segment's deletions.{quote}

I'll use this as the go-ahead design then.

bq. Is that still true?

That's from Java 1.6.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-17 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933182#action_12933182
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Additionally we need to decide how accounting'll work for
maxBufferedDeleteTerms. We won't have a centralized place to keep track of
the number of terms, and the unique term count in aggregate over many
segments could be a little too time consuming calculate in a method like
doApplyDeletes. An alternative is to maintain a global unique term count,
such that when a term is added, every other per-segment deletes is checked
for that term, and if it's not already been tallied, we increment the number
of buffered terms.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933227#action_12933227
 ] 

Michael McCandless commented on LUCENE-2680:


{quote}
I think we may be back tracking here as I had earlier proposed we simply
store each term/query in a map per segment, however I think that was nixed
in favor of last segment + deletes per segment afterwards. We're not
worried about the cost of storing pending deletes in a map per segment
anymore?
{quote}

OK sorry now I remember.

Hmm but, my objection then was to carrying all deletes backward to all
segments?

Whereas now I think what we can do is only record the deletions that
were added when that segment was a RAM buffer, in its pending deletes
map?  This should be fine, since we aren't storing a single deletion
in multiple places (well, until DWPTs anyway).  It's just that on
applying deletes to a segment because it's about to be merged we have
to do a merge sort of the buffered deletes all "future" segments.

BTW it could also be possible to not necessarily apply deletes when a
segment is merged; eg if there are few enough deletes it may not be
worthwhile.  But we can leave that to another issue.

{quote}
Additionally we need to decide how accounting'll work for
maxBufferedDeleteTerms. We won't have a centralized place to keep track of
the number of terms, and the unique term count in aggregate over many
segments could be a little too time consuming calculate in a method like
doApplyDeletes. An alternative is to maintain a global unique term count,
such that when a term is added, every other per-segment deletes is checked
for that term, and if it's not already been tallied, we increment the number
of buffered terms.
{quote}

Maybe we should change the definition to be total number of pending
delete term/queries?  (Ie, not dedup'd across segments).  This seems
reasonable since w/ this new approach the RAM consumed is in
proportion to that total number and not to dedup'd count?


> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-17 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933299#action_12933299
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

{quote}Maybe we should change the definition to be total number of pending
delete term/queries? {quote}

Lets go with this, as even though we could record the total unique term
count, the approach outlined is more conservative.

{quote}I think what we can do is only record the deletions that were added
when that segment was a RAM buffer, in its pending deletes map{quote}

Ok, sounds like a design that'll work well.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-17 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933305#action_12933305
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Flush deletes equals true means that all deletes are applied, however when it's 
false, that means we're moving the pending deletes into the newly flushed 
segment, as is, with no docId-upto remapping.  

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-17 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933306#action_12933306
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

We can "upgrade" to an int[] from an ArrayList for the aborted docs.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-19 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934058#action_12934058
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I'm seeing the following error which is probably triggered by the new 
per-segment deletes code, however also could be related to the recent CFS 
format changes?

{code}
MockDirectoryWrapper: cannot close: there are still open files: {_0.cfs=1, 
_1.cfs=1}
[junit] java.lang.RuntimeException: MockDirectoryWrapper: cannot close: 
there are still open files: {_0.cfs=1, _1.cfs=1}
[junit] at 
org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:395)
[junit] at 
org.apache.lucene.index.TestIndexReader.testReopenChangeReadonly(TestIndexReader.java:1717)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:921)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:859)
[junit] Caused by: java.lang.RuntimeException: unclosed IndexInput
[junit] at 
org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:350)
[junit] at 
org.apache.lucene.store.Directory.openInput(Directory.java:138)
[junit] at 
org.apache.lucene.index.CompoundFileReader.(CompoundFileReader.java:67)
[junit] at 
org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:121)
[junit] at 
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:527)
[junit] at 
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:628)
[junit] at 
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:603)
[junit] at 
org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1081)
[junit] at 
org.apache.lucene.index.IndexWriter.applyDeletesAll(IndexWriter.java:4300)
[junit] at 
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3440)
[junit] at 
org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3276)
[junit] at 
org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3266)
[junit] at 
org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3131)
[junit] at 
org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3206)
{code}

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-21 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934381#action_12934381
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Running TestStressIndexing2 500 times on trunk causes this error which is 
probably intermittent:

{code}
[junit] Testsuite: org.apache.lucene.index.TestStressIndexing2
[junit] Testcase: 
testMultiConfigMany(org.apache.lucene.index.TestStressIndexing2): Caused an 
ERROR
[junit] Array index out of range: 0
[junit] java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 0
[junit] at java.util.Vector.get(Vector.java:721)
[junit] at 
org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1049)
[junit] at 
org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4291)
[junit] at 
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3444)
[junit] at 
org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3279)
[junit] at 
org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3269)
[junit] at 
org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1760)
[junit] at 
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1723)
[junit] at 
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1687)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.indexRandom(TestStressIndexing2.java:233)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:123)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.testMultiConfigMany(TestStressIndexing2.java:97)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
{code}

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-21 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934384#action_12934384
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

The above isn't on trunk, I misread the screen.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-21 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934388#action_12934388
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I've isolated the mismatch in num docs between the CMS vs. SMS generated
indexes to applying the deletes to the merging segments (whereas currently
we were/are not applying deletes to merging segments and
TestStressIndexing2 passes). Assuming the deletes are being applied
correctly to the merging segments, perhaps the logic of gathering up
forward segment deletes is incorrect somehow in the concurrent merge case.
When deletes were held in a map per segment, this test was passing. 

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-21 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934390#action_12934390
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

A test to see if the problem is the deletes per-segment go forward logic is to 
iterate over the deletes flushed map using the docid-upto to stay within the 
boundaries of the segment(s) being merged.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935441#action_12935441
 ] 

Michael McCandless commented on LUCENE-2680:


What a nice small patch :)

I think the getDeletesSegmentsForward shouldn't be folding in the
deletesInRAM?  Ie, that newly flushed info will have carried over the
previous deletes in RAM?

I think pushDeletes/pushSegmentDeletes should be merged, and we should
nuke DocumentsWriter.deletesFlushed?  Ie, we should push directly from
deletesInRAM to the new SegmentInfo?  EG you are now pushing all
deletesFlushed into the new SegmentInfo when actually you should only
push the deletes for that one segment.

We shouldn't do the remap deletes anymore.  We can remove
DocumentsWriter.get/set/updateFlushedDocCount too.

Hmm... so what are we supposed to do if someone opens IW, does a bunch
of deletes, then commits?  Ie flushDocs is false, so there's no new
SegmentInfo.  I think in this case we can stick the deletes against
the last segment in the index, with the docidUpto set to the maxDoc()
of that segment?


> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964441#action_12964441
 ] 

Michael McCandless commented on LUCENE-2680:


bq. In the apply merge deletes case, won't we want to add deletesInRAM in the 
getForwardDeletes method?

No, we can't add those deletes until the current buffered segment is 
successfully flushed.

Eg, say the segment hits a disk full on flush, and DocsWriter aborts (discards 
all buffered docs/deletions from that segment).  If we included these 
deletesInRAM when applying deletes then suddenly the app will see that some 
deletes were applied yet the added documents were not.  So on disk full during 
flush, calls to .updateDocument may wind up deleting the old doc but not adding 
the new one.

So we need to keep them segregated for proper error case semantics.

{quote}
Though for the failing unit test it does not matter, we need to figure
out a solution for the pending doc ids deletions, eg, they can't simply
transferred around, they probably need to be applied as soon as possible.
Otherwise they require remapping.
{quote}

Hmm why must we remap?  Can't we carry these buffered deleteByDocIDs along with 
the segment?  The docIDs would be the segment's docIDs (ie no base added) so no 
shifting is needed?

These deleted docIDs would only apply to the current segment, ie would not be 
included in getForwardDeletes?

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964450#action_12964450
 ] 

Michael McCandless commented on LUCENE-2680:


So nice to see remapDeletes deleted!

  * Don't forget to remove DocumentsWriter.get/set/updateFlushedDocCount too.

  * Can you move the deletes out of SegmentInfo?  We can just use a
Map?  But remember to delete segments
from the map once we commit the merge...

  * I think DocsWriter shouldn't hold onto the SegmentInfos; we should
pass it in to only those methods that need it.  That SegmentInfos
is protected under IW's monitor so it makes me nervous if it's
also a member on DW.

  * Hmm we're no longer accounting for RAM usage of per-segment
deletes?  I think we need an AtomicInt, which we incr w/ RAM used
on pushing deletes into a segment, and decr on clearing?

  * The change to the message(...) in DW.applyDeletes is wrong (ie
switching to deletesInRAM); I think we should just remove the
details, ie so it says "applying deletes on N segments"?  But then
add a more detailed message per-segment with the aggregated
(forward) deletes details?

  * I think we should move this delete handling out of DW as much as
possible... that's really IW's role (DW is "about" flushing the
next segment, not tracking details associated with all other
segments in the index)

  * Instead of adding pushDeletesLastSegment, can we just have IW call
pushDeletes(lastSegmentInfo)?

  * Calling .getForwardDeletes inside the for loop iterating over the
infos is actually O(N^2) cost, and it could matter for
delete-intensive many-segment indices.  Can you change this,
instead, to walk the infos backwards, incrementally building up
the forward deletes to apply to each segment by adding in that
infos deletions?


> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-28 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964606#action_12964606
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I guess you think the sync on doc writer is the cause of the
TestStressIndexing2 unit test failure?

bq. I think we should move this delete handling out of DW

I agree, I originally took this approach however unit tests were failing
when segment infos was passed directly into the apply deletes method(s).
This'll be the 2nd time however apparently the 3rd time's the charm.

I'll make the changes and cross my fingers.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-28 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964610#action_12964610
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I started on taking the approach of moving deletes to a SegmentDeletes class
that's a member of IW. Removing DW's addDeleteTerm is/was fairly trivial. 

In moving deletes out of DW, how should we handle the bufferDeleteTerms sync on
DW and the containing waitReady? The purpose of BDT is to check if RAM
consumption has reached it's peak, and if so, balance out the ram usage and/or
flush pending deletes that are ram consuming. This is probably why deletes are
intertwined with DW. We could change DW's BDT method though I'm loathe to
change the wait logic of DW for fear of causing a ripple effect of inexplicable
unit test failures elsewhere.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964680#action_12964680
 ] 

Michael McCandless commented on LUCENE-2680:


{quote}
I guess you think the sync on doc writer is the cause of the
TestStressIndexing2 unit test failure?
{quote}

I'm not sure what's causing the failure, but, I think getting the net approach 
roughly right is the first goal, and then we see what's failing.

{quote}
bq. I think we should move this delete handling out of DW

I agree, I originally took this approach however unit tests were failing
when segment infos was passed directly into the apply deletes method(s).
This'll be the 2nd time however apparently the 3rd time's the charm.
{quote}

Not only moving the SegmentInfos out of DW as a member, but also move all the 
applyDeletes logic out.  Ie it should be IW that pulls readers from the pool, 
walks the merged del term/queries/per-seg docIDs and actually does the deletion.

bq. In moving deletes out of DW, how should we handle the bufferDeleteTerms 
sync on DW and the containing waitReady?

I think all the bufferDeleteX would move into IW, and timeToFlushDeletes. The 
RAM accounting can be done fully inside IW.

The waitReady(null) is there so that DW.pauseAllThreads also pauses any threads 
doing deletions.  But, in moving these methods to IW, we'd make them sync on IW 
(they are now sync'd on DW), which takes care of pausing these threads.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-29 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964938#action_12964938
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

bq. The waitReady(null) is there so that DW.pauseAllThreads also pauses any 
threads doing deletions

waitReady is used in getThreadState as well as bufferDeleteX, we may need to 
redundantly add it to SegmentDeletes?  Maybe not.  We'll be sync'ing on IW when 
adding deletions, that seems like it'll be OK.  

{quote}in moving these methods to IW, we'd make them sync on IW (they are now 
sync'd on DW), which takes care of pausing these threads{quote}

Because we're sync'ing on IW we don't need to pause the indexing threads?  Ok 
this is because doFlushX is sync'd on IW.  

{quote}The RAM accounting can be done fully inside IW.{quote}

Well, inside of SegmentDeletes.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-30 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965468#action_12965468
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Here's a random guess, I think because with this patch we're applying deletes
sometimes multiple times, whereas before we were applying all of them and
clearing them out at once, there's a mismatch in terms of over/under-applying
deletes. Oddly when deletes are performed in _mergeInit on all segments vs.
only on the segments being merged, the former has a much higher success rate.
This is strange because all deletes will have been applied by the time
commit/getreader is called anyways. 

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-12-08 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969504#action_12969504
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

When patching there are errors on IndexWriter.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-12-09 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969754#action_12969754
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

We're close, I think SegmentDeletes is missing?

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-12-09 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969799#action_12969799
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

The patch applied.

Ok, a likely cause of the TestStressIndexing2 failures was that when we're
flushing deletes to the last segment (because a segment isn't being flushed),
we needed to move deletes also to the newly merged segment?

In the patch we've gone away from sync'ing on IW when deleting, which was a
challenge because we needed the sync on DW to properly wait on flushing threads
etc.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-12-09 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969822#action_12969822
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

All tests pass.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-12-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969914#action_12969914
 ] 

Michael McCandless commented on LUCENE-2680:


{quote}
Ok, a likely cause of the TestStressIndexing2 failures was that when we're
flushing deletes to the last segment (because a segment isn't being flushed),
we needed to move deletes also to the newly merged segment?
{quote}

Right, and also the case where a merge just-ahead of you kicks off and dumps 
its merged deletes onto you.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-12-11 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970446#action_12970446
 ] 

Michael McCandless commented on LUCENE-2680:


OK I committed this to trunk.

Since it's a biggish change I'll hold off on back-porting to 3.x for now... 
let's let hudson chew on it some first.

> Improve how IndexWriter flushes deletes against existing segments
> -
>
> Key: LUCENE-2680
> URL: https://issues.apache.org/jira/browse/LUCENE-2680
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
> LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

58 matches

Mail list logo