[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13885673#comment-13885673 ] Sergey Shelukhin commented on HBASE-7667: - the docs were committed to the book as part of HBASE-9854; sorry for late reply Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Fix For: 0.98.0, 0.99.0 Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, stripe-cdf.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883479#comment-13883479 ] Andrew Purtell commented on HBASE-7667: --- Any doc updates available? 0.98.0RC1 is open. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Fix For: 0.98.0, 0.99.0 Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, stripe-cdf.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843340#comment-13843340 ] Sergey Shelukhin commented on HBASE-7667: - With 98 coming so soon, probably not.[~stack] wdyt? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Fix For: 0.98.0, 0.99.0 Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, stripe-cdf.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843355#comment-13843355 ] stack commented on HBASE-7667: -- Yeah. Too late for 0.96. We are trying to get back on to a bugs-only in point releases praxis -- unless there a citizen revolt. Also need reason for folks to upgrade to 0.98! Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Fix For: 0.98.0, 0.99.0 Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, stripe-cdf.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842713#comment-13842713 ] Otis Gospodnetic commented on HBASE-7667: - Btw. is this going to get into any 0.96.x releases by any chance? Thanks. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Fix For: 0.98.0, 0.99.0 Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, stripe-cdf.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808236#comment-13808236 ] stack commented on HBASE-7667: -- bq. Yes, unless there's non-uniform key access there will be more, but smaller, compactions. As per [~srivas] supposition above I suppose. bq. You probably don't want to turn it on by default, as it makes sense either for non-uniform data or for large regions. So for what cases should we turn it on? bq. The documents, one of them targeted at users (ans all of them out of date), are attached to this very JIRA Pardon me. I've reviewed a bunch of this feature -- docs and code -- and am just having trouble quantifying the benefit this slew of new code brings in. Sorry if I am being thick. If I read the attached user doc, will it be clear (though it is out of date?) Let me try it. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808248#comment-13808248 ] stack commented on HBASE-7667: -- If I read the user doc., it says: This improves readperformance in common scenarios and greatly reduces variability, by avoiding large and/orinefficient compactions If I read further, there is no clear message on when enabling stripe compactions makes sense. Doc is missing a section on when NOT to use stripe compactions. The doc. has detail but seems like a bunch no longer applies after recent reworkings (as you say above). Do you have some stats on how it can improve life under certain workloads and what those workloads are? My concern is that a bunch of code will go into hbase and it will sit there unused. We have enough of that already. I'd like to have some clear messaging around this feature, both how it can benefit, and also how a user could enable it and see the effects of its workings. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808404#comment-13808404 ] Sergey Shelukhin commented on HBASE-7667: - [~stack] I understand your concern. I will get to documenting it this or next week, so it should be able to get at least into 98. As far as I know, there are some people who wanted to try it out,and also there's timestamp compaction jira which might be obviated... let's see if as experimental feature it can get adoption. It's pretty well isolated now, so it should be easy to remove later, or move into separate module out of the way. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806903#comment-13806903 ] Sergey Shelukhin commented on HBASE-7667: - I have run the maven tests and rebased all patches (no changes except a tiny one in compator)... if no objections I will commit HBASE-7679, HBASE-7680, HBASE-7967, HBASE-8000 to trunk today afternoon if there are no objections by then. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807290#comment-13807290 ] stack commented on HBASE-7667: -- So, stripe compactions does more i/o unless it is the time series use case? I cannot turn this on by default? Where do I go to read on benefits of this new addition? Thanks [~sershe] Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807397#comment-13807397 ] Sergey Shelukhin commented on HBASE-7667: - Yes, unless there's non-uniform key access there will be more, but smaller, compactions. HBASE-8541 makes for less IO amplification, Enis is reviewing it, it will probably follow quickly. You probably don't want to turn it on by default, as it makes sense either for non-uniform data or for large regions. The documents, one of them targeted at users (ans all of them out of date), are attached to this very JIRA ;) The configuration was simplified quite a bit compared to the state of the doc. Let me file a JIRA to document things in e.g. the book. For the first release it will be positioned as an experimental feature... Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690757#comment-13690757 ] Sergey Shelukhin commented on HBASE-7667: - Size-based scheme works by splitting stripes when they grow big. This splitting is good for sequential sharded keys, because lower part of the split is written as one file, and doesn't receive new data (or doesn't receive a lot of it anyway), so it doesn't have to participate in compactions. If you have uniform data, splitting result in more rewriting and both stripes keep growing after the split. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690761#comment-13690761 ] Sergey Shelukhin commented on HBASE-7667: - So it's easier to configure (just say I want 500Mb-1Gb-... stripes) but in the net, results in more I/O during initial data population before region reaches stable size. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688945#comment-13688945 ] stack commented on HBASE-7667: -- Rereading the design doc and how-to-use. They are very nice. Can go into the book. High-level, and I think you have suggested this yourself elsewhere, it'd be coolio if user didn't have to choose between size and count -- if it'd just figure itself based off incoming load. I've seen case where a compaction produces a zero-length file (all deletes) so would that mess w/ this invariant: Compaction mustproduce at least one file(seeHBASE-6059). or ...No stripe can everbe leftwith0 files... I almost asked a few questions you'd already answered above in my previous read of the doc (smile). How would region merge work? We'd just drop all files into L0? Sounds like we'd have to drop references if we are not to break snapshotting. You think this true? stripescheme useslarger number of files than default to ensure all compactions are small, which can affect verywidescans. Any measure of how much? Should stripe be on by default? Or have it as experimental for now until we get more data? How to use doc is excellent (though too many configs). Will review patch again next. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase)
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689954#comment-13689954 ] Sergey Shelukhin commented on HBASE-7667: - I'm looking at how to merge policies. Unfortunately splitting uses more I/O than not splitting (who would have though... :)), resulting in worse perf. Also, the system cannot really predict future data patterns, no more than region splitting can do it (at least not with a lot of complexity added), so hint flag for how to split would need to be provided. W.r.t. producing files to contain metadata, that is unfortunately necessary. These files shouldn't have effect. Stripes with only expired files can be merged. I've taken a stab at auto-detecting stripes from file metadata, in general case it's very complex, in simplified realistic case it's just complex. Merge will drop everything into L0, yes. This could be improved, but has to be done now anyway due to references, same with split, so no need to do it now. On-by-default would require smart default settings. Let me comment tomorrow on HBASE-7680, if I can make a size-count hybrid quickly I will post final patch without a lot of logic changes there, and hopefully we can commit 3 initial patches and build on top of that. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690015#comment-13690015 ] stack commented on HBASE-7667: -- bq. splitting uses more I/O than not splitting Sorry. You mean stripes uses more i/o because we L0 first then rewrite into stripes? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13654572#comment-13654572 ] Jonathan Hsieh commented on HBASE-7667: --- [~sershe], thanks for the graph. Seeing that, I think a avg/std deviation could be an even simpler way of showing where this compaction approach demonstrates a win. It looks like the variance of times will be significantly higher with default and it seems that avg time would be about the same. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13651556#comment-13651556 ] Elliott Clark commented on HBASE-7667: -- Thanks for the doc. Reading this tonight. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: stripe-cdf.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649312#comment-13649312 ] Raymond commented on HBASE-7667: great, more region lead to better load balance and good compaction effect, less region lead to easy management and fast failover, I think stripe (or sub-region) is a good trade-off. And I think stripe compaction is similar with Level compaction with L0+L1 only. Another difficult is about configuration, in big hbase cluster, there are so many applications, how to build suitable configuation for each one will be a huge challenge. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf, Using stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618967#comment-13618967 ] Sergey Shelukhin commented on HBASE-7667: - Currently only one compaction per store is allowed. The need to compact several stripes in parallel can probably be alleviated by just having less stripes? As future improvement it is possible to add. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619380#comment-13619380 ] Sergey Shelukhin commented on HBASE-7667: - Actually, judging by logs what can be done is triggering compaction thread if store can compact. In 25-stripe case I see gaps between compactions which are unnecessary, when the compaction only triggers on flush despite plenty of tiny stripe compactions being possible Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618189#comment-13618189 ] Matt Corgan commented on HBASE-7667: Sergey - i'm curious how are compactions of the stripes being scheduled/queued? Does a region still make a single region-wide compaction request, and the compactor picks a single stripe? Or can multiple stripes be in the compaction queue at once? Given that regions could be allowed to grow much larger with stripe compaction enabled it would probably be good to allow multiple stripes to compact in parallel. Just a thought for another next step... you've probably considered it already. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617951#comment-13617951 ] Sergey Shelukhin commented on HBASE-7667: - I did a c1.xlarge test (with default, 3, 10 and 25 stripes, 2 times each). The results for different stripe configurations are very consistent across both runs. Compared to m1.large test the positive effect of increasing number of stripes on write speed is less. For this load, sweet spot appears to be around 10-12 stripes based on two tests. 3 stripes have large compactions similar to default (well, not as large); 25 stripes does too many small compactions, so select-compact loop cannot keep up with the number of files produced - on Iteration 2 test described in the doc at least some stripes in 25-stripe case always have 6-8 small files (as they get compacted other stripes get more files). This appears to be the limiting factor on increasing the number of stripes. I think the main point is that, for count scheme, there's perf parity (writes are generally slightly slower, reads slightly faster), despite existing and fixable write amplification; and there's reduction of variability, which was the goal. I will try to devise a more realistic read workload, but I don't think it should change much given above. For sequential data, with size-based stripe scheme there's reduction in compactions, as expected, despite even L0. Next steps: 1) On existing data I want to correlate read/write perf with compactions. It is interesting that stripe scheme has slower writes in general, as Jimmy has noted - it touches read path but not anything at all on write path, so it is probably I/O related, or stresses some interaction between existing write and compaction paths. 2) Run tests for more realistic read workloads (and parallel read/writes), by not using LoadTestTool? Optional-ish. 3) Clean up integration test patch in HBASE-8000. 4) Review and commit? :) 5) Get rid of L0? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really.
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617956#comment-13617956 ] Ted Yu commented on HBASE-7667: --- bq. 5) Get rid of L0? Can we do this frist ? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616654#comment-13616654 ] Sergey Shelukhin commented on HBASE-7667: - I will update the doc, although in this case m1.large moderate IO capacity works even better for the test, making it easier to simulate IO-constrained cluster on such small number of nodes/time frame. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616976#comment-13616976 ] Sergey Shelukhin commented on HBASE-7667: - Btw, the 3 next child JIRAs are rady for review. Please feel free to +1 them, I will only commit all 3 together and with integration test included. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compaction perf evaluation.pdf, Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf, Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614805#comment-13614805 ] Andrew Purtell commented on HBASE-7667: --- I'd recommend retesting with c1.xlarge instance types, this will get you a lot closer to real hardware IMHO. The IO capability of the c1.xlarge is high vs. only moderate for m1.large and the c1.xlarge has 8 vcores as opposed to _2 only_ for the m1.large. The c1.xlarge will have 4 locally attached instance-store volumes while the m1.large has only 2 IIRC. Also, I didn't see it mentioned in the perf doc but you should use only the locally attached instance store volumes as datanode storage volumes to avoid variance introduced by EBS. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compaction perf evaluation.pdf, Stripe compactions.pdf, Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13612653#comment-13612653 ] M. C. Srivas commented on HBASE-7667: - @mcorgan and @stack: The total i/o in terms of i/o bandwidth consumed is the same. But the disk iops are much, much worse. And disk iops are at a premium, and bg activity like compactions should consume as few as possible. Let's say we split a region into a 100 sub-regions, such that each sub-region is in the few 10's of MB. If the data is written uniformly randomly, each sub-region will write out a store at approx the same time. That is, a RS will write 100x more files into HDFS (100x more random i/o on the local file-system). Next, all sub-regions will do a compaction at almost the same time, which is again 100x more read iops to read the old stores for merging. One can try to stagger the compactions to avoid the sudden burst by incorporating, say, a queue of to-be-compacted-subregions. But while the sub-regions at the head of the queue will compact in time, the ones at the end of the queue will have many more store files to merge, and will use much more than their fair-share of iops (not to mention that the read-amplification in these sub-regions will be higher too). The iops profile will be worse than just 100x. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13612808#comment-13612808 ] stack commented on HBASE-7667: -- [~srivas] bq. But the disk iops are much, much worse. See Sergey's writeup. We flush same as we always did writing a single file to the L0 tier. It is later at compaction time -- i.e. NOT random i/o -- that we'd write a file per sub-region/stripe. If the write is evenly distributed, we'd do the same overall i/o except with stripe compacting it would be done in smaller bite sizes ([~sershe] Would the compaction of stripes run in //? Hopefully, for the case Srivas describes, we'd progress serially through the stripes/sub-regions or at least it would be an option and then later, ergonomically, we'd recognize the even-loading case and add compaction accordingly) You have a point that we will be making more files in the fs. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13612847#comment-13612847 ] Matt Corgan commented on HBASE-7667: {quote}a RS will write 100x more files into HDFS (100x more random i/o on the local file-system){quote}I think this is a point of confusion. A typical HBase file could be 4MB to 40GB, where those files are a series of 4KB (very small) underlying disk blocks. Ignoring complexities of multiple tasks running simultaneously on the regionserver, only the first 4KB block of each file is a random write, while the following blocks are sequential writes. The few extra random writes should be lost in the noise of all the other random IO requests happening. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13612956#comment-13612956 ] Sergey Shelukhin commented on HBASE-7667: - bq. The first half of the sentence seems to be incomplete. bq. I think I understand what you mean - such KVs would be written by previous writer. bq. Missing author, date, and JIRA pointer? bq. I think you need to say stripe == sub-range of the region key range. You almost do. Just do it explicitly. bq. What does this mean andold boundaries rarely, if ever, moving.? Give doc an edit? bq. Say in doc that you mean storefile metadata else it is ambiguous. bq. Not sure I follow here: This compaction is performed whenthe number of L0 files exceeds somethreshold and producesthe number of files equivalent to the number of stripes,withenforcedexisting boundaries. Fixed these. bq. An interesting comment by LarsH recently was that maybe we should ship w /major compactions off; most folks don't delete Hmm... in general I agree but we'll have to insert really good warnings everywhere. Can we detect if they delete? :) bq. Missing is one a pointer at least to how it currently works (could just point at src file I'd say with its description of 'sigma' compactions) and a sentence on whats wrong w/ it bq. Later I suppose we could have a combination of count-based and size-based if an edge stripe is N time bigger than any other, add a new stripe? Yeah, it's mentioned in code comment somewhere. bq. I was wondering if you could make use of liang xie's bit of code for making keys for the block cache where he chooses a byte sequence that falls between the last key in the former block and the first in the next block but the key is shorter than either. but it doesn't make sense here I believe; your boundaries have to be hard actual keys given inserts are always coming in so nevermind this suggestion. For boundary determination it does make sense; can you point at the code? After cursory look I cannot find it. bq. You write the stripe info to the storefile. I suppose it is up to the hosting region whether or not it chooses to respect those boundaries. It could ignore them and just respect the seqnum and we'd have the old-style storefile handling, right? (Oh, I see you allow for this -- good) Yes. bq. Thinking on L0 again, as has been discussed, we could have flushes skip L0 and flush instead to stripes (one flush turns into N files, one per stripe) but even if we had this optimization, it looks like we'd still want the L0 option if only for bulk loaded files or for files whose metadata makes no sense to the current region context. • The aggregate range of files going in mustbe contiguous... Not sure I follow. Hmm... could do with going into a compaction Yes, that was my thinking too. bq. If the stripe boundaries are changed by compaction, the entire stripes withold boundaries mustbe replaced ...What would bring this on? And then how would old boundaries get redone? This one is a bit confusing. Clarified. Basically one cannot have 3 files in (-inf, 3) and 3 in [3, inf), then take 3 and 2 respectively, and rewrite them with boundary 4, because then there will be a file with [3, inf) remaining that overlaps. bq. I was going to suggest an optimization for later for the case that an L0 fits fully inside a stripe, I was thinking you could just 'move' it into its respective stripe... but I suppose you can't do that because you need to write the metadata to put a file into a stripe... Yeah. Also wouldn't expect it to be a common case. bq. Would it help naming files for the stripe they belong too? Would that help? In other words do NOT write stripe data to the storefiles and just let the region in memory figure which stripe a file belongs too. When we write, we write with say a L0 suffix. When compacting we add S1, S2, etc suffix for stripe1, etc. To figure what the boundaries of an S0 are, it'd be something the region knew. On open of the store files, it could use the start and end keys that are currently in the file metadata to figure which stripe they fit in. bq. Would be a bit looser. Would allow moving a file between stripes with a rename only. The delete dropping section looks right. I like the major compaction along a stripe only option. This could be done as future improvement. The implications of change of naming scheme for other parts of the systems need to be determined. Also for all I know it might break snapshots (moving files does). And, code to figure ut stripes on the fly would be more complex. bq. Foremptyranges,empty files are
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613004#comment-13613004 ] stack commented on HBASE-7667: -- HBASE-7845 optimize hfile index key is the key/boundary determination work I was referring to (I don't think it applies here but adding the reference since you asked for it) Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613454#comment-13613454 ] Lars Hofhansl commented on HBASE-7667: -- bq. An interesting comment by LarsH recently was that maybe we should ship w /major compactions off; most folks don't delete Hmm... I don't doubt that I said this, but I'm not sure that I agree :) Many people do delete and just not removing the delete markers would be unexpected. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf, Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611728#comment-13611728 ] M. C. Srivas commented on HBASE-7667: - There is of course one major caveat with this approach. If data insertion is uniformly spread (ie, key is uniform random), this proposal performs much worse than the existing scheme. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611820#comment-13611820 ] Sergey Shelukhin commented on HBASE-7667: - There are two approaches discussed, one similar to having many small regions and one for sequential data. Which one do you mean? I am testing the first one with uniformly distributed keys now and it's somewhat slower than default case on average (on writes mostly) but has no big compaction associated latency spikes... I suspect if there was not so much compaction due to L0 the write slowness could also be alleviated. I haven't tested the 2nd one yet (requires a more custom test, next week) but it's very specialized, for sequential data, so yes it is not good for common case. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611843#comment-13611843 ] stack commented on HBASE-7667: -- bq. There is of course one major caveat with this approach. If data insertion is uniformly spread (ie, key is uniform random), this proposal performs much worse than the existing scheme. [~srivas] Why? Won't it do same total i/o? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611844#comment-13611844 ] stack commented on HBASE-7667: -- {code} On attached doc, it is lovely. Missing author, date, and JIRA pointer? An interesting comment by LarsH recently was that maybe we should ship w /major compactions off; most folks don't delete Missing is one a pointer at least to how it currently works (could just point at src file I'd say with its description of 'sigma' compactions) and a sentence on whats wrong w/ it or the problems it leads too when left run amok (you say it for major compactions but even w/o major compactions enabled, an i/o tsunami can hit and wipe us out What does this mean andold boundaries rarely, if ever, moving.? Give doc an edit? I think you need to say stripe == sub-range of the region key range. You almost do. Just do it explicitly. I see your extra justification for l0, the need to be able to bulk load. It is kinda important that we continue to support that. Good one. Later I suppose we could have a combination of count-based and size-based if an edge stripe is N time bigger than any other, add a new stripe? I was wondering if you could make use of liang xie's bit of code for making keys for the block cache where he chooses a byte sequence that falls between the last key in the former block and the first in the next block but the key is shorter than either. but it doesn't make sense here I believe; your boundaries have to be hard actual keys given inserts are always coming in so nevermind this suggestion. You write the stripe info to the storefile. I suppose it is up to the hosting region whether or not it chooses to respect those boundaries. It could ignore them and just respect the seqnum and we'd have the old-style storefile handling, right? (Oh, I see you allow for this -- good) Say in doc that you mean storefile metadata else it is ambiguous. Thinking on L0 again, as has been discussed, we could have flushes skip L0 and flush instead to stripes (one flush turns into N files, one per stripe) but even if we had this optimization, it looks like we'd still want the L0 option if only for bulk loaded files or for files whose metadata makes no sense to the current region context. • The aggregate range of files going in mustbe contiguous... Not sure I follow. Hmm... could do with going into a compaction If the stripe boundaries are changed by compaction, the entire stripes withold boundaries mustbe replaced ...What would bring this on? And then how would old boundaries get redone? This one is a bit confusing. Get key before is a PITA Not sure I follow here: This compaction is performed when the number of L0 files exceeds somethreshold and producesthe number of files equivalent to the number of stripes,withenforcedexistingboundaries. I was going to suggest an optimization for later for the case that an L0 fits fully inside a stripe, I was thinking you could just 'move' it into its respective stripe... but I suppose you can't do that because you need to write the metadata to put a file into a stripe... Would it help naming files for the stripe they belong too? Would that help? In other words do NOT write stripe data to the storefiles and just let the region in memory figure which stripe a file belongs too. When we write, we write with say a L0 suffix. When compacting we add S1, S2, etc suffix for stripe1, etc. To figure what the boundaries of an S0 are, it'd be something the region knew. On open of the store files, it could use the start and end keys that are currently in the file metadata to figure which stripe they fit in. Would be a bit looser. Would allow moving a file between stripes with a rename only. The delete dropping section looks right. I like the major compaction along a stripe only option. Foremptyranges,empty files are created. Is this necessary? Would be good to avoid doing this. {code} Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611909#comment-13611909 ] Matt Corgan commented on HBASE-7667: {quote}If data insertion is uniformly spread (ie, key is uniform random), this proposal performs much worse than the existing scheme.{quote}I think the goal for uniformly random keys is to have the same amount of total work done but to stagger that work. Instead of doing 1 big 24 GB compaction per day, it could do a 1 GB compaction each hour. The savings/efficiency become more pronounced with less random keys, with the biggest savings for sequential keys. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611324#comment-13611324 ] Sergey Shelukhin commented on HBASE-7667: - Hmm... I am starting to think that we might want to consider getting rid of L0 after all. Somehow it escaped me that L0 gives you x2 write amplification right there as all data has to be re-striped. Small files actually don't increase the number of files for gets/non-overlapping scans because each L0 file still counts for each stripe. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609873#comment-13609873 ] Ted Yu commented on HBASE-7667: --- Nice document, Sergey. bq. This can obviously be improved since most (or all) stripes, see future improvements. The first half of the sentence seems to be incomplete. bq. Before starting a new writer, compactor ensures that all the KVs for the last row in the previous writer go to the previous writer. I think I understand what you mean - such KVs would be written by previous writer. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: Stripe compactions.pdf So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578445#comment-13578445 ] Jimmy Xiang commented on HBASE-7667: Getting rid of L0 means memstore flushing will take longer and hold update/read longer? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578513#comment-13578513 ] stack commented on HBASE-7667: -- bq. Getting rid of L0 means memstore flushing will take longer and hold update/read longer? Yes but on back side, less compactions (no need to compact on region open since no half files/references) and for many workloads, less compactions overall because only a subset of files will be picked up from L0 tier. Can do work too to minimize how much we hold flushes. Could be dumb and on flush, inline, examine key spread so can make ruling on where to insert boundaries (figuring boundaries would be for first flush only? Or I suppose boundary making would be ongoing over the life of a region... four boundaries per region seems like a nice number to work with)... so this would be a scan of the 64MB memstore keys or, we could keep a running tally as we insert into the memstore so at flush time we knew where the boundaries were... could do stuff like // request to NN to open the files to flush too. In general, we want to add smarts around flushing (stripes, in-memory compacting, etc.) so eventually we are going to have this friction on flush (IMO). Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578037#comment-13578037 ] Sergey Shelukhin commented on HBASE-7667: - If the user copies the file, few things may happen. If he uses the non-stripe compaction, files will just get compacted into fewer files eventually. If all files are copied, stripe compactions are used and stripe config is different, SFM will load old stripes and eventually re-stripe the data to conform to new config, if necessary. That will depend on compactionpolicy implementation. If part of the files are copied (whatever sense that makes), either the above will still happen, or SFM will give up on metadata, and put all files to L0, re-striping them according to new config after some time. SFM needs metadata from all files, but as far as I see from HStore that is already loaded, because other code makes use of metadata without special considerations. One thing is that with more files there will be more to open... Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578087#comment-13578087 ] Matt Corgan commented on HBASE-7667: Sergey - I'm curious what's the reasoning behind flushing memstore to a single L0 file rather than splitting the memstore into the stripes during each flush? Keep flushes faster, less files, etc? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578090#comment-13578090 ] Sergey Shelukhin commented on HBASE-7667: - My reasoning was - too many really tiny files, plus scope creep into memstore. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578099#comment-13578099 ] stack commented on HBASE-7667: -- Would it simplify the filesystem implementation if you did the split in memory (caveat the scope creep up into memstore) so no special L0 tier? Regards files being too small, that is a different issue (e.g. a Toddea -- Todd+idea -- that I tripped over recently in an issue here is that rather than flush immediately, that we'd instead do a purge and compaction in memory before flushing to ensure the content large enough to make a file... but yeah, that'd be something else). Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578105#comment-13578105 ] Matt Corgan commented on HBASE-7667: Gotcha. Agree about limiting scope. If the special L0 tier turns out to be more difficult to implement than originally intended for whatever reason, might be worth evaluating splitting during flush. Seems like the same number of files might get created anyway when you split the L0 file? Or do you plan on doing some logical striping across the L1 boundary as Nicolas says above where the L0 files are never truly split? Like Stack mentions, longer term I think we'll need to split memstore while in use, and those splits should probably have some alignment with these stripe boundaries. For another day... Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578108#comment-13578108 ] Sergey Shelukhin commented on HBASE-7667: - L0 actually should be relatively easy to implement, it's the special cases about all the possible stripe boundaries that cause all the complexity. L0 files will hopefully not be split individually, but as soon as we reach some number of files, similarly to level db algorithm. But yeah with insta-stripe solution we could get rid of L0 and also get less files for reads. Could be future improvement. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578192#comment-13578192 ] stack commented on HBASE-7667: -- I'd like to get rid of L0 so can do splitting w/o resorting to file References and half hfiles (smile) but yes, could be future as you say Sergey. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13576798#comment-13576798 ] Sergey Shelukhin commented on HBASE-7667: - The stripe boundaries are supposed to be an internal detail not visible to the user, the user only configured the scheme (#of stripes, or size-based stripes as described for time series data). What kind of stripe configuration do you have in mind? After talking about splits yesterday I am planning to make it a little more flexible for splits, but still internal. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13576826#comment-13576826 ] stack commented on HBASE-7667: -- bq. Currently Sergey puts the boundaries in metadata of store files. What is wrong w/ this? It is the bounds of the keys the file contains? (Is this not there already, the first and last key?) What are the thoughts on figuring region stripe boundaries. Will it be done by looking at the memstore content just before flush and dividing it into n files? Or will it be done after the first flush on subsequent compactions by looking at content of L0? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13576829#comment-13576829 ] Sergey Shelukhin commented on HBASE-7667: - In the N stripes scheme, boundaries will be determined at first L0 compaction. I am intending to add parameter to config to make first L0 compaction wait for more files to get better ones. Then, the stripes can be rebalanced but the hope is that it happens infrequently. In the growing key range scheme (for time series above) stripe boundaries don't matter much, and are basically determined based on size. If nothing intervenes, by EOW or next week I hope to have preliminary compaction policy/compactor patch. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13576836#comment-13576836 ] Ted Yu commented on HBASE-7667: --- bq. It is the bounds of the keys the file contains? (Is this not there already, the first and last key?) Currently HFile provides the following methods: {code} byte[] getFirstRowKey(); byte[] getLastRowKey(); {code} Stripe boundaries should be different from the above. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13576845#comment-13576845 ] stack commented on HBASE-7667: -- [~ted_yu] You raise a 'concern' that is nebulous to shoot down a suggested implementation and then when asked why you do not answer the question asked. [~sershe] Would it be easier looking at memstore than at content of L0? Would have to do appropriate weighting... in that there will be hot spots in the keyspace and these should get more stripes. Rejiggering the stripes joining up the infrequently used and allotting more stripes to the hot keyspace area sounds right. Good stuff. Maybe its worth writing up a doc at this stage. This issue is full of goodness. A doc could distill it out and make it easier on the digestion and easier to think about it. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13576881#comment-13576881 ] Sergey Shelukhin commented on HBASE-7667: - Discussed with Ted, what was meant is configuring it like region splitting, which is not planned at this point. Doc - probably needed... I will get to it eventually. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577311#comment-13577311 ] Ted Yu commented on HBASE-7667: --- Here is my understanding of the difference between existing store file metadata and the new stripe boundary metadata. Existing store file metadata (first row key, last row key, etc) is intrinsic to the store file. i.e., their meaning doesn't change when the store file gets copied to another cluster. However stripe boundary doesn't seem to carry the same characteristics. Let's consider table A1 in cluster C1 and table A2 in cluster C2. They have the same schema and region boundaries. When store file F1 is copied from C1 to C2, user may switch to a different striping strategy. The reason could be that clusters C1 and C2 serve different patterns of load. Embedding stripe boundary in store file may potentially cause confusion in cluster C2. Another factor is that StoreFileManager needs to scan all store files to establish / validate stripe boundaries when region opens. Please correct me if I am wrong. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575883#comment-13575883 ] Nicolas Spiegelberg commented on HBASE-7667: Some thoughts I had about this: Overall, I think it's a good idea. Seems like it's not crazy to add and would have multiple benefits. Logical striping across the L1 boundary is a simple solution to both proactively handle splits and reduce compaction times. Thoughts on this feature 1. Fixed configs : in the same way that we got a lot of stability by limiting the regions/server to a fixed number, we might want to similarly limit the number of stripes per region to 10 (or X) instead of every Y bytes. This will help us understand the benefit we get from striping and it's easy to double the striping and chart the difference. 2. NameNode pressure : Obviously, a 10x striping factor will cause 10x scaling of the FS. Can we offset this by increasing the HDFS block size, since addBlock dominates at scale? Really, unlike Hadoop, you have all of the HFile or none of it. Missing a portion of the HFile currently invalidates the whole file. You really need 1 HDFS block == 1 HFile. However, we could probably just toy with increasing it by the striping factor right now and seeing if that balances things. 3. Open Times : I think this will be an issue, specifically on server start. Need to be careful here. 4. Major compaction : you can perform a major compaction (remove deletes) as long as you have [i,end) contiguous. I don't think you'd need to involve L0 files in an MC at all. Save the complexity. Furthermore, part of the reason why we created the tiered compaction is to prevent small/new files from participating in MC because of cache thrashing, poor minor compactions, and a handful of other reasons. So, some thoughts on related pain points we seem to have that tie into this feature: 1. Reduce Cache thrashing : region moves kill us a lot of time because we have a cold cache. There is a worry that more aggressive compactions mean more thrashing. I think it will actual even this out better since right now a MC causes a lot of churn. Just should be thinking about this if perf after the feature isn't what we desire. 2. Unnecessary IOPS : outside of this algorithm, we should just completely get rid of the requirement to compact after a split. We have the block cache, so given a [start,end) in the file, we can easily tell our mid point for future splits. There's little reason to aggressively churn in this way after splitting. 3. Poor locality : for grid topology setups, we should eventually make the striping algorithm a little more intelligent about picking our replicas. If all stripes go to the same secondary tertiary node, then splits have a very restricted set of servers to chose for datanode locality. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575936#comment-13575936 ] Sergey Shelukhin commented on HBASE-7667: - bq. It may be a follow on to this jira, but having striper dynamically add stripes at the end of the region would let allow all the stripes before the last one go cold which is critical for avoiding hugely wasteful compactions of non-changing data Actually, it can be added as part of the main work, HBASE-7679 (file management) code includes such capabilities. I wonder how, no matter the compactions, does region management work for such scenario. Wouldn't all the load always be on last region if you have TS keys? Or, if you have artificial partitioning but query by TS, wouldn't all queries go to all servers? bq. To major compact a stripe, all L0 files, if any, can be split into stripes, then merge all files belonging to the stripe. Can you explain more about the delete marker limitation? Suppose in current compaction selection, I choose a set of files starting at the oldest file but not including all files. Wouldn't that be enough to process delete markers that delete the updates within those files? Granted, I might not process all delete markers, but I don't have to see all files. E.g. if I only have 3 files with one entry for K each, K=V, delete K, K=V2, and I compact the first two, I can remove entries for K from them, right? bq. 1. Fixed configs : in the same way that we got a lot of stability by limiting the regions/server to a fixed number, we might want to similarly limit the number of stripes per region to 10 (or X) instead of every Y bytes. This will help us understand the benefit we get from striping and it's easy to double the striping and chart the difference. That is the original idea. Thanks for other comments :) Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575951#comment-13575951 ] stack commented on HBASE-7667: -- In a striped region, if stripes = 2 and the key distribution is basically even, a split could be done w/o references, halfhfiles, and rewriting from parent to daughter; a split could just rename the parent files into the daughter regions. It could make for split simplification and possibly make for some i/o savings. This is a load of great stuff in this issue. Best read I've had in a long time. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575995#comment-13575995 ] Matt Corgan commented on HBASE-7667: {quote}Open Times : I think this will be an issue, specifically on server start. Need to be careful here.{quote}Hopefully could be mitigated by making regions larger, like doubling region size and setting max 2 stripes/region. Theoretically should be able to have the same overall number of files as normal regions, or are there other factors at play? {quote}Wouldn't all the load always be on last region if you have TS keys? Or, if you have artificial partitioning but query by TS, wouldn't all queries go to all servers?{quote}An easy strategy for P partitions is to * prepend a single byte to each key where prefix=hash(row)%P * pre-split the table into P regions * tweak the balancer to evenly spread the tail partitions for each region * writes get sprayed evenly to all tail partitions * a single Get query will only hit one region since you know hash(row)%P beforehand * you scan all P partitions using a P-way collating iterator ** so yes, scans go to all servers but presumably they are huge and would hit lots of data anyway ** because they are huge, a client that scans the partitions concurrently will be faster * a big multi-Get will spray to the exact servers necessary, possibly all of them, but like scans may be faster because done in parallel I'm not sure what most people are doing with time series data but this seems like a good approach to me. You basically just choose arbitrarily large P. An MD5 prefix is essentially P=2^128 (I wouldn't recommend pre-splitting at that granularity). Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13576023#comment-13576023 ] Sergey Shelukhin commented on HBASE-7667: - bq. a split could just rename the parent files into the daughter regions I was told this is impossible due to snapshots relying on files not moving between regions (or on references during splits?) We just discussed this here, some improvements for splits should definitely be possible. bq. so yes, scans go to all servers but presumably they are huge and would hit lots of data anyway Yeah, I meant the scans, was assuming scans for TS data mostly. I see. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13576044#comment-13576044 ] Jimmy Xiang commented on HBASE-7667: bq. Wouldn't that be enough to process delete markers that delete the updates within those files? I think so, as long as the files not included are newer (like in L0). Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13576296#comment-13576296 ] Ted Yu commented on HBASE-7667: --- I think we need to find the best way of storing stripe boundaries. Currently Sergey puts the boundaries in metadata of store files. This is not flexible. Once this feature is released, users would request ways to configure stripes in different manners. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575449#comment-13575449 ] Ted Yu commented on HBASE-7667: --- bq. Sub-region is not a good name either though. Other names we can consider: arena, range, realm, section, sector, zone. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575578#comment-13575578 ] Jimmy Xiang commented on HBASE-7667: Stripes don't have overlapping keyrange with other stripes. So each stripe is just like a sub-region. L0 files are special, could overlap with multiple stripes. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575583#comment-13575583 ] Lars Hofhansl commented on HBASE-7667: -- I see, then a major compaction needs to at least include all current L0 files (in any), right? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575589#comment-13575589 ] Jimmy Xiang commented on HBASE-7667: That's right. To major compact a stripe, all L0 files, if any, can be split into stripes, then merge all files belonging to the stripe. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575206#comment-13575206 ] Jimmy Xiang commented on HBASE-7667: bq. Overall, I think the sub-region approach cuts the rope in the tug-of-war. It lets you have a smaller number of regions at the same time as having the most efficient compactions. I agree. I was wondering if it will take longer to open a region if there are many hfiles. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575297#comment-13575297 ] stack commented on HBASE-7667: -- [~mcorgan] Nice write up. So you are in favor of Sergey's project? [~jxiang] If I were to guess, it will take longer than the case where we have monolithic files that cover the total region namespace as we currently have (Because there will be more files). If rather than strip'ing inside a region, we instead had a region per stripe, my guess is that stripe'ing will take less time to open since less region machinations going on (less .regioninfos to open, less looking in fs for stuff to clean up since last file open, less listing of storefiles under families). Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575309#comment-13575309 ] Matt Corgan commented on HBASE-7667: {quote}So you are in favor of Sergey's project?{quote}oh yes, if you could not tell. Thinking of an HBASE-7667 tatoo =) One of the few major things hbase is missing in my opinion is the ability to load time-series through the normal api, rather than having to go off and write some separate bulk load code. HBase currently takes a dump when you do that. Main culprits are HBASE-5479 and my comment in HBASE-3484 (https://issues.apache.org/jira/browse/HBASE-3484?focusedCommentId=13410934page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13410934). Even during normal operation as opposed to a one-off import of data, the inefficiencies are still happening, just at a less obvious pace. It may be a follow on to this jira, but having striper dynamically add stripes at the end of the region would let allow all the stripes before the last one go cold which is critical for avoiding hugely wasteful compactions of non-changing data. Ideally, it would be able to allocate small stripes as new data comes in (each flush?) and then later go on to merge older stripes to reduce hfile count (at major compaction time?). With this in place on an N node cluster, you could partition your data with N or 2N regions using a hash prefix and basically let the regions grow infinitely large. Currently I have to limit region size to ~2GB which results in hundreds of regions per node which is a bit of a management hassle because it's beyond human readable, and a bit wasteful with all the empty memstores among other things. I do wonder if there's a more accurate name than stripe. Stripes make me think of RAID stripes which is a different concept than sub-regions. Sub-region is not a good name either though. It would be cool if you could set a column family attribute like layout=TIME_SERIES which HBase could use to automatically pick the compaction strategy, split-point strategy, balancer strategy, and allow future niceties like using stronger compression on old data. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575312#comment-13575312 ] Lars Hofhansl commented on HBASE-7667: -- Stripes can have overlapping keyrange with other stripes, correct? I.e. if two L0 files are compacted with a L0-compation their each L0 is striped but since the L0-files could overlap, the stripes could too with other stripes from different L0 files. A nice property in LevelDB is that only L0 files have overlapping keyspaces with other files, all level L0 have no overlapping keys within a level and no file at level L overlaps more than 10 files at L+1. Right now only major compactions can remove delete markers because only by looking at all data you can guarantee that you will see each KV that might be affected by the delete marker. It's not clear to me how we get around this, unless we introduce a formal notion of levels and know which L+1 files overlap with a file at level L. Would love to discuss more at the PowWow on the Feb 19th. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575012#comment-13575012 ] Jimmy Xiang commented on HBASE-7667: A stripe is like a sub-region. That's a good idea. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575024#comment-13575024 ] Matt Corgan commented on HBASE-7667: Right now there is a tug-of-war between region size and number of regions. You want fewer regions for: * server startup/shutdown * minimizing RPC calls * having fewer/bigger memstore flushes * fewer open files per server You want more regions for: * spreading load among servers * having more efficient compactions {quote}spreading load among servers{quote} Since BigTable was designed machines have grown tremendously, with many running 24 cpus and 48+GB memory. These machines can serve many regions even if each region is 30GB. So we can achieve good load distribution even with enormous regions. {quote}having more efficient compactions{quote} But, huge regions are bad for compactions. 30GB compactino is still considered expensive. Data in HBase is generally not perfectly evenly distributed across a table, or if it is you are not taking full advantage of HBase's sorted architecture. Huge regions therefore have hot and cold stripes/sub-regions. If you compact a 30GB region, you are wasting a ton of time on the cold stripes. Overall, I think the sub-region approach cuts the rope in the tug-of-war. It lets you have a smaller number of regions at the same time as having the most efficient compactions. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Components: Compaction Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562257#comment-13562257 ] Sergey Shelukhin commented on HBASE-7667: - What do you guys think about the general idea? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562268#comment-13562268 ] Matt Corgan commented on HBASE-7667: So a stripe is like a sub-region? In terms of compactions, it sounds like it serves the same purpose as splitting a table into regions, so you can compact a region that is hot while letting cold regions stay cold. If this is the case and stripes are allowed to auto-split, it may be very beneficial for time series data. If you had a region approaching 10gb with 100mb stripes, the last stripe would keep splitting and the first 99 would never get touched. The problem without stripes is that the first 9900mb keeps getting re-written even though it never changes. Am i understanding it correctly that stripes in a region would act similarly to regions in a table? Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562273#comment-13562273 ] chunhui shen commented on HBASE-7667: - Interesting ideas. With stripe compaction, we could support many files in one region, each file belongs to one stripe, and no overlap keys cross stripes except LO, is it right? I think it is useful for the sequential write scenario. bq.if we could move files between regions, no references would be needed) Moving files would break snapshots, references are needed all the same Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562278#comment-13562278 ] Sergey Shelukhin commented on HBASE-7667: - Hmm, my thinking would be that number of stripes will be fixed and we would rebalance, but never split as such. Yes, the idea for sequential data plays very well into this, the code will probably be almost the same. With non-seq data, it would just try to achieve effect similar to level. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562286#comment-13562286 ] Sergey Shelukhin commented on HBASE-7667: - If there are no general objections I will try to start on this tomorrow, in preference to level... I am looking at some random HCM issues now so patch may be next week provided that HBASE-7516 and HBASE-7603 can go in. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7667) Support stripe compaction
[ https://issues.apache.org/jira/browse/HBASE-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562290#comment-13562290 ] Matt Corgan commented on HBASE-7667: I've been brainstorming something similar for splitting the memstore into stripes, mentioned in https://issues.apache.org/jira/browse/HBASE-3484?focusedCommentId=13410934page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13410934 I think it's a good idea now that region sizes have become so large. It's easy to have a few hot stripes in a region if it's 10GB, and not necessarily wrong from a primary-key design perspective. It's often very wasteful to be compacting the whole region. Support stripe compaction - Key: HBASE-7667 URL: https://issues.apache.org/jira/browse/HBASE-7667 Project: HBase Issue Type: New Feature Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin So I was thinking about having many regions as the way to make compactions more manageable, and writing the level db doc about how level db range overlap and data mixing breaks seqNum sorting, and discussing it with Jimmy, Matteo and Ted, and thinking about how to avoid Level DB I/O multiplication factor. And I suggest the following idea, let's call it stripe compactions. It's a mix between level db ideas and having many small regions. It allows us to have a subset of benefits of many regions (wrt reads and compactions) without many of the drawbacks (managing and current memstore/etc. limitation). It also doesn't break seqNum-based file sorting for any one key. It works like this. The region key space is separated into configurable number of fixed-boundary stripes (determined the first time we stripe the data, see below). All the data from memstores is written to normal files with all keys present (not striped), similar to L0 in LevelDb, or current files. Compaction policy does 3 types of compactions. First is L0 compaction, which takes all L0 files and breaks them down by stripe. It may be optimized by adding more small files from different stripes, but the main logical outcome is that there are no more L0 files and all data is striped. Second is exactly similar to current compaction, but compacting one single stripe. In future, nothing prevents us from applying compaction rules and compacting part of the stripe (e.g. similar to current policy with rations and stuff, tiers, whatever), but for the first cut I'd argue let it major compact the entire stripe. Or just have the ratio and no more complexity. Finally, the third addresses the concern of the fixed boundaries causing stripes to be very unbalanced. It's exactly like the 2nd, except it takes 2+ adjacent stripes and writes the results out with different boundaries. There's a tradeoff here - if we always take 2 adjacent stripes, compactions will be smaller but rebalancing will take ridiculous amount of I/O. If we take many stripes we are essentially getting into the epic-major-compaction problem again. Some heuristics will have to be in place. In general, if, before stripes are determined, we initially let L0 grow before determining the stripes, we will get better boundaries. Also, unless unbalancing is really large we don't need to rebalance really. Obviously this scheme (as well as level) is not applicable for all scenarios, e.g. if timestamp is your key it completely falls apart. The end result: - many small compactions that can be spread out in time. - reads still read from a small number of files (one stripe + L0). - region splits become marvelously simple (if we could move files between regions, no references would be needed). Main advantage over Level (for HBase) is that default store can still open the files and get correct results - there are no range overlap shenanigans. It also needs no metadata, although we may record some for convenience. It also would appear to not cause as much I/O. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira