[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments
[ https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733498#action_12733498 ] Shai Erera commented on LUCENE-1750: bq.we could add an optimize(long maxSegmentSize) This I think would be useful anyway, and kind of required if we introduce the proposed merge policy. Otherwise, if someone's code calls optimize (w/ or w/o num segments limit), those large segments will be optimized as well. bq. except if it accumulates too many deletes (as a percentage of docs) then it can be compacted and new segments merged into it? If one would call expungeDeletes, and that segment will go below the max size, then it will be eligible for merging, right? But I have a question here, and it may be that I'm missing something in the merge process. Say I have the following segments, each at 4 GB (the limit), except D: A (docs 0-99), B (docs 100-230), C (docs 231-450) and D (docs 451-470). Then A accumulates 50 deletes. On one hand, we'd want it to be merged, but if we want that, we have to merge B and C either, right? We cannot merge A w/ D, because the doc IDs need to be in increasing order and retain the order they were added to the index? So will the merge policy detect that? I think that it should and the way to work around that is to ensure that the first segment which is below the limit, triggers the merge of all following segments (in doc ID order), regardless of their size? I don't know if your patch already takes care of this case, and whether my understanding is correct, so if you already handle it that way (or some other way), then that's fine. Create a MergePolicy that limits the maximum size of it's segments -- Key: LUCENE-1750 URL: https://issues.apache.org/jira/browse/LUCENE-1750 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1750.patch Original Estimate: 48h Remaining Estimate: 48h Basically I'm trying to create largish 2-4GB shards using LogByteSizeMergePolicy, however I've found in the attached unit test segments that exceed maxMergeMB. The goal is for segments to be merged up to 2GB, then all merging to that segment stops, and then another 2GB segment is created. This helps when replicating in Solr where if a single optimized 60GB segment is created, the machine stops working due to IO and CPU starvation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments
[ https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733740#action_12733740 ] Jason Rutherglen commented on LUCENE-1750: -- {quote}We cannot merge A w/ D, because the doc IDs need to be in increasing order and retain the order they were added to the index?{quote} The segments are merged in order because they may be sharing doc stores. I think we can refine this to only merge contiguous segments that are sharing doc stores, otherwise we can merge non-contiguous segments which continues with LUCENE-1076? When the shards are in their own directories (which is how Katta works), the building process is somewhat easier as we're dealing with a separate segmentInfos for each shard. I am not sure how Solr would handle an index sharded into multiple directories. Create a MergePolicy that limits the maximum size of it's segments -- Key: LUCENE-1750 URL: https://issues.apache.org/jira/browse/LUCENE-1750 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1750.patch Original Estimate: 48h Remaining Estimate: 48h Basically I'm trying to create largish 2-4GB shards using LogByteSizeMergePolicy, however I've found in the attached unit test segments that exceed maxMergeMB. The goal is for segments to be merged up to 2GB, then all merging to that segment stops, and then another 2GB segment is created. This helps when replicating in Solr where if a single optimized 60GB segment is created, the machine stops working due to IO and CPU starvation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments
[ https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733762#action_12733762 ] Shai Erera commented on LUCENE-1750: bq. I think we can refine this to only merge contiguous segments that are sharing doc stores So in this case it means that segment A will remain smaller than 4 GB and will never get merged (b/c segments B and C reached their limit)? Create a MergePolicy that limits the maximum size of it's segments -- Key: LUCENE-1750 URL: https://issues.apache.org/jira/browse/LUCENE-1750 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1750.patch Original Estimate: 48h Remaining Estimate: 48h Basically I'm trying to create largish 2-4GB shards using LogByteSizeMergePolicy, however I've found in the attached unit test segments that exceed maxMergeMB. The goal is for segments to be merged up to 2GB, then all merging to that segment stops, and then another 2GB segment is created. This helps when replicating in Solr where if a single optimized 60GB segment is created, the machine stops working due to IO and CPU starvation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments
[ https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733389#action_12733389 ] Jason Rutherglen commented on LUCENE-1750: -- Wouldn't you want them to be merged into an even larger segment? I think once the segment reaches the limit (i.e. 4GB), it's effectively done and nothing more happens to it, except if it accumulates too many deletes (as a percentage of docs) then it can be compacted and new segments merged into it? I think first of all, as we reach the capacity of the machine's IO and RAM, large segment merges thrash the machine (i.e. the IO cache is ruined and must be restored, IO is unavailable for searches, further indexing stops), they become too large to pass between servers (i.e. Hadoop, Katta, or Solr's replication). I'm not sure how much search degrades due to 10-20 larger segments as opposed to a single massive 60GB segment? But if search is unavailable on a machine due to the CPU and IO thrashing (of massive segment merges) it seems like a fair tradeoff? I think optimize remains as is although I would never call it. Or we could add an optimize(long maxSegmentSize) method which is analogous to optimize(int maxSegments). Create a MergePolicy that limits the maximum size of it's segments -- Key: LUCENE-1750 URL: https://issues.apache.org/jira/browse/LUCENE-1750 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1750.patch Original Estimate: 48h Remaining Estimate: 48h Basically I'm trying to create largish 2-4GB shards using LogByteSizeMergePolicy, however I've found in the attached unit test segments that exceed maxMergeMB. The goal is for segments to be merged up to 2GB, then all merging to that segment stops, and then another 2GB segment is created. This helps when replicating in Solr where if a single optimized 60GB segment is created, the machine stops working due to IO and CPU starvation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1750) Create a MergePolicy that limits the maximum size of it's segments
[ https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732974#action_12732974 ] Shai Erera commented on LUCENE-1750: What happens after several such large segments are created? Wouldn't you want them to be merged into an even larger segment? Or, you'll have many such segments and search performance will degrade. I guess I never thought this is a problem. If I have enough disk space, and my index size reaches 600 GB (which is a huge index), and is split across 10 different segments of size 60GB each, I guess I'd want them to be merged into one larger 600GB segment. It will take ions until I'll accumulate another such 600 GB segment, no? Maybe we can have two merge factors: 1) for small segments, or up to a set size threshold, where we do the merges regularly. 2) Then, for really large segments we say the marge factor is different. For example, we can say that up to 1GB the merge factor is 10, and beyond the merge factor is 20. That will postpone the large IO merges until enough such segments accumulate. Also, w/ the current proposal, how will optimize work? Will it skip the very large segments, or will they be included too? Create a MergePolicy that limits the maximum size of it's segments -- Key: LUCENE-1750 URL: https://issues.apache.org/jira/browse/LUCENE-1750 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1750.patch Original Estimate: 48h Remaining Estimate: 48h Basically I'm trying to create largish 2-4GB shards using LogByteSizeMergePolicy, however I've found in the attached unit test segments that exceed maxMergeMB. The goal is for segments to be merged up to 2GB, then all merging to that segment stops, and then another 2GB segment is created. This helps when replicating in Solr where if a single optimized 60GB segment is created, the machine stops working due to IO and CPU starvation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org