Matt Corgan summarized the pro and con of having large number of regions here:
https://issues.apache.org/jira/browse/HBASE-7667?focusedCommentId=13575024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13575024 Cheers On Sun, Feb 10, 2013 at 7:43 PM, Joarder KAMAL <joard...@gmail.com> wrote: > Hi Kevin, > > Thanks a lot for your great answers. > Regarding Q5. To clarify, > > lets say Facebook is using HBase for the integrated messaging/chat/email > system in a very large-scale setup. And schema design of such system can > change over the years (even over the months). Workload patterns may also > change due to different usage characteristics (like the rate of messaging > may be higher during a protest/specific event in a particular country). So, > region/table hotspots have been created at random region servers within the > cluster despite careful schema design and pre-planning. > > The facebook team rush to split the hotspotted regions manually and > redistribute them over a new set of physical machines which are recently > added to the system to increase scalability in the face of high user > demand. Now hotspotted region data could be transferred into new physical > machines gradually to handle the situations. Now if the shard (region) size > is small enough then data transfer cost over the network could be minimum > otherwise large volume of data needs to be transferred instantly. > > I have found in many places it is discouraged to have a large number of > regions systems. However, would it be possible to have very large number of > regions in a system thus minimizing data transfer cost in case hotspotting > due to workload/design characteristics. Is there any drawbacks or known > side-effects? I am rethinking other possibilities other pre-planned schema > and row-key designs. > > > Thanks again. > > Regards, > Joarder Kamal > > > On 11 February 2013 13:32, Kevin O'dell <kevin.od...@cloudera.com> wrote: > > > Hi Joarder, > > > > Welcome to the HBase world. Let me take some time to address your > > questions the best I can: > > > > 1. How often you are facing Region or Table Hotspotting in HBase > > production systems? <--- Hotspotting is not something that just > happens. > > This is usually caused by bad key design and writing to one region more > > than the others. I would recommend watching some of Lar's YouTube videos > > on Schema Design in HBase. > > 2. If a hotspot is created, how quickly it is automatically cleared > out > > (assuming sudden workload change)? <--- It will not be automatically > > "cleared out" I think you may be misinformed here. Basically, it is on > you > > to watch you table and your write distribution and determine that you > have > > a hotspot and take the necessary action. Usually the only action is to > > split the region. If hotspots become a habitual problem you would most > > likely want to go back and re-evaluate your current key. > > 3. How often this kind of situation happens - A hotspot is detected > and > > vanished out before taking an action? or hotspots stays longer period > of > > time? <--- Please see above > > 4. Or if the hotspot is stays, how it is handled (in general) in > > production system? <--- Some people have to hotspot on purpose early > on, > > because they only write to a subset of regions. You will have to > manually > > watch for hotspots(which is much easier in later releases). > > 5. How large data transfer cost is minimized or avoid for re-sharding > > regions within a cluster in a single data center or within WAN? <--- > Not > > quite sure what you are saying here, so I will take a best guess at it. > > Sharding is handled in HBase by region splitting. The best way to > success > > in HBase is to understand your data and you needs BEFORE you create you > > table and start writing into HBase. This way you can presplit your table > > to handle the incoming data and you won't have to do a massive amounts of > > splits. Later you can allow HBase to split your tables manually, or you > > can set the maxfile size high and manually control the splits or > sharding. > > 6. Is hotspoting in HBase cluster is really a issue (big!) nowadays > for > > OLAP workloads and real-time analytics? <--- Just design your schema > > correctly and this should not be a problem for you. > > > > Please let me know if this answers your questions. > > > > On Sun, Feb 10, 2013 at 9:17 PM, Joarder KAMAL <joard...@gmail.com> > wrote: > > > > > This is my first email in the group. I am having a more general and > > > open-ended question but hope to get some reasoning from the HBase user > > > communities. > > > I am a very basic HBase user and still learning. My intention to use > > HBase > > > in one of our research project. Recently I was looking through Lars > > > George's book "HBase - The Definitive Guide" and two particular topics > > > caught my eyes. One is 'Region and Table Hotspotting' and the other is > > > 'Region Auto-Sharding and Merging'. > > > > > > *Scenario: * > > > If a hotspot is created in a particular region or in a table (having > > > multiple regions) due to sudden workload change, then one may split the > > > region into further small pieces and distributed it to a number of > > > available physical machine in the cluster. This process should require > > > large data transfer between different machines in the cluster and > incur a > > > performance cost. One may also change the 'key' definition and manage > the > > > regions. But I am not sure how effective or logical to change key > designs > > > on a production system. > > > > > > *Questions:* > > > > > > 1. How often you are facing Region or Table Hotspotting in HBase > > > production systems? > > > 2. If a hotspot is created, how quickly it is automatically cleared > > out > > > (assuming sudden workload change)? > > > 3. How often this kind of situation happens - A hotspot is detected > > and > > > vanished out before taking an action? or hotspots stays longer > period > > of > > > time? > > > 4. Or if the hotspot is stays, how it is handled (in general) in > > > production system? > > > 5. How large data transfer cost is minimized or avoid for > re-sharding > > > regions within a cluster in a single data center or within WAN? > > > 6. Is hotspoting in HBase cluster is really a issue (big!) nowadays > > for > > > OLAP workloads and real-time analytics? > > > > > > > > > Further directions to more information about region/table hotspotting > is > > > most welcome. > > > > > > Many thanks in advance. > > > > > > Regards, > > > Joarder Kamal > > > > > > > > > > > -- > > Kevin O'Dell > > Customer Operations Engineer, Cloudera > > >