I would go on using the row-key, on one table. = Row Key Structure = <group-depth><A group><B group><C group><D group>
group-depth: 1..4, encoded as 1 byte A-D group; encoded as 1 byte and not as string Examples: <1><192> <2><192><168> <3><192><168><1> <4><192><168><1><10> Column Qualifier: "c" - stands for counters Column Qualifier: "t" - stands for total When you get a request for 192.168.1.10, you need to increase 4 rows, so build 4 Increment objects ands send them to HBase using HTable.batch. Each Increment object will increase the "t" column. When you scan, simply scan for the range based on the group. For example, all 192.168 group can get by fetch rows with prefix of <2><192><168> (each numbers is a byte in the byte array you compose as prefix). You'll get back at most 255 rows. In IPv4 you can have , on a popular site, 6-7 million unique IPs in 10 minutes of traffic. You can enhance it by having a column qualifier for each hour, by converting the epoch of that hour (long) into a byte array, on top of having that all-hours total counter. This way you can filter the traffic by range of dates/hours. On Sun, Jan 27, 2013 at 6:51 PM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > Hi, > > Let's imagine this scenario. > > I want to store IPs with counters. And I want to have counters by > groups of IPs. All of that will be calculated with MR jobs and stored > in HBase. > > Let's take some IPs and make sure they are ordered by adding some "0" > when required. > > 037.113.031.119 > 058.022.018.176 > 058.022.159.151 > 109.169.201.076 > 109.169.201.150 > 109.254.019.140 > 122.031.039.016 > 122.224.005.210 > 178.137.167.041 > > I want to have counters for all "levels" of those IPs. Which mean for > those groups. > > Group 1: > 037 > 058 > 109 > 122 > 178 > > Group 2: > > 037.113 > 058.022 > 109.169 > 109.254 > 122.031 > 122.224 > 178.167 > > Group 3: > > 037.113.031 > 058.022.018 > 058.022.159 > 109.169.201 > 109.254.019 > 122.031.039 > 122.224.005 > 178.137.167 > > And group 4 is the complete IPs list. > > Each time I see an IP, I will increment the required values into the 4 > groups. > > What's the bests way to store that knowing that I want to be able to > easily list all the entries (ranged based) from one group. > > Option 1 is to have one table per group. 1CF, 1C > Pros: Very easy to access, retrieve, etc. > Cons: Will generate 4 tables > > Option 2 is to have one table, but 1 CF per group. > Pros: Only one table, easy access. > Cons: Heard that we should try to keep CFs under 3. Might have bad > performances impacts. > > Option 3 is to have one table, one CF and one C per group. > Pros: Only one table, only one CF. > Cons: Access is less easy than option 1 and 2. > > I think Option 2 is the worst one. Option 1 is very easy to implement. > And for option 3, I don't see any benefit compared to option 1. > > So I'm tempted to go with option 1, but I don't like the idea of > multiplying the table. > > Does anyone have any comment on which options might be the best one, > or even proposed another option? > > JM >