[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733498#action_12733498
 ] 

Shai Erera commented on LUCENE-1750:


bq.we could add an optimize(long maxSegmentSize)

This I think would be useful anyway, and kind of required if we introduce the 
proposed merge policy. Otherwise, if someone's code calls optimize (w/ or w/o 
num segments limit), those large segments will be optimized as well.

bq. except if it accumulates too many deletes (as a percentage of docs) then it 
can be compacted and new segments merged into it?

If one would call expungeDeletes, and that segment will go below the max size, 
then it will be eligible for merging, right? But I have a question here, and it 
may be that I'm missing something in the merge process. Say I have the 
following segments, each at 4 GB (the limit), except D:
A (docs 0-99), B (docs 100-230), C (docs 231-450) and D (docs 451-470). Then A 
accumulates 50 deletes. On one hand, we'd want it to be merged, but if we want 
that, we have to merge B and C either, right? We cannot merge A w/ D, because 
the doc IDs need to be in increasing order and retain the order they were added 
to the index?

So will the merge policy detect that? I think that it should and the way to 
work around that is to ensure that the first segment which is below the limit, 
triggers the merge of all following segments (in doc ID order), regardless of 
their size?

I don't know if your patch already takes care of this case, and whether my 
understanding is correct, so if you already handle it that way (or some other 
way), then that's fine.

 Create a MergePolicy that limits the maximum size of it's segments
 --

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733740#action_12733740
 ] 

Jason Rutherglen commented on LUCENE-1750:
--

{quote}We cannot merge A w/ D, because the doc IDs need to be in
increasing order and retain the order they were added to the
index?{quote}

The segments are merged in order because they may be sharing doc
stores. I think we can refine this to only merge contiguous
segments that are sharing doc stores, otherwise we can merge
non-contiguous segments which continues with LUCENE-1076? 

When the shards are in their own directories (which is how Katta
works), the building process is somewhat easier as we're dealing
with a separate segmentInfos for each shard. I am not sure how
Solr would handle an index sharded into multiple directories. 

 Create a MergePolicy that limits the maximum size of it's segments
 --

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733762#action_12733762
 ] 

Shai Erera commented on LUCENE-1750:


bq. I think we can refine this to only merge contiguous segments that are 
sharing doc stores

So in this case it means that segment A will remain smaller than 4 GB and will 
never get merged (b/c segments B and C reached their limit)?

 Create a MergePolicy that limits the maximum size of it's segments
 --

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-20 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733389#action_12733389
 ] 

Jason Rutherglen commented on LUCENE-1750:
--

 Wouldn't you want them to be merged into an even larger
segment? 

I think once the segment reaches the limit (i.e. 4GB), it's
effectively done and nothing more happens to it, except if it
accumulates too many deletes (as a percentage of docs) then it
can be compacted and new segments merged into it?

I think first of all, as we reach the capacity of the machine's
IO and RAM, large segment merges thrash the machine (i.e. the IO
cache is ruined and must be restored, IO is unavailable for
searches, further indexing stops), they become too large to pass
between servers (i.e. Hadoop, Katta, or Solr's replication). 

I'm not sure how much search degrades due to 10-20 larger
segments as opposed to a single massive 60GB segment? But if
search is unavailable on a machine due to the CPU and IO
thrashing (of massive segment merges) it seems like a fair
tradeoff?

I think optimize remains as is although I would never call it.
Or we could add an optimize(long maxSegmentSize) method which is
analogous to optimize(int maxSegments). 

 Create a MergePolicy that limits the maximum size of it's segments
 --

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments

2009-07-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732974#action_12732974
 ] 

Shai Erera commented on LUCENE-1750:


What happens after several such large segments are created? Wouldn't you want 
them to be merged into an even larger segment? Or, you'll have many such 
segments and search performance will degrade.

I guess I never thought this is a problem. If I have enough disk space, and my 
index size reaches 600 GB (which is a huge index), and is split across 10 
different segments of size 60GB each, I guess I'd want them to be merged into 
one larger 600GB segment. It will take ions until I'll accumulate another such 
600 GB segment, no?

Maybe we can have two merge factors: 1) for small segments, or up to a set size 
threshold, where we do the merges regularly. 2) Then, for really large segments 
we say the marge factor is different. For example, we can say that up to 1GB 
the merge factor is 10, and beyond the merge factor is 20. That will postpone 
the large IO merges until enough such segments accumulate.

Also, w/ the current proposal, how will optimize work? Will it skip the very 
large segments, or will they be included too?

 Create a MergePolicy that limits the maximum size of it's segments
 --

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org