Accumulo sorts keys and then compresses them in blocks. Each block is compressed without any information external to the block. You're going to see different compression ratios depending on the relative entropy of keys inside of each compressed block. For purpose of discussion, let's simplify this to two elements in the row: A|B or B|A
Suppose set A is {0,1,2,3,4}, and set B is {maroon,orange,purple,yellow}. We can make the following keys: 0|maroon 0|orange 0|purple 0|yellow 1|maroon ... and maroon|0 maroon|1 maroon|2 maroon|3 orange|0 orange|1 ... Now, let's also assume that a block fits 4 keys in it. In the A|B case the first block has to represent the following: {0,maroon,orange,purple,yellow} In the B|A case the first block has to represent the following: {maroon,0,1,2,3} The block in the A|B case has a higher relative entropy, since the B set contains more information and we have to represent the entire B set in the block. You can see that visually depicted here in that the string representation of the information in the A|B block is twice as long as the string representation of the information in the B|A block. This is admittedly a crude example, but hopefully it helps you see some of the elements that contribute to compression ratio. Practically speaking, the best way to get an estimate for the size of a table is to put in some real data and take measurements. Try to add data in such a way that your compressed blocks are going to be similar to those that will be there in a full table. So, in the A|B case sample from A and use a complete B, and in the B|A case sample from B and use a complete set A. If you make your blocks representative of the full table then a linear extrapolation will give you a pretty good estimate for size. Doing this piece-wise for each of the types of blocks (tables, in your case) should also work. Hope that helps! Adam On Tue, Sep 8, 2015 at 9:19 AM, z11373 <z11...@outlook.com> wrote: > I have 3 tables, all of them have same column family name, and empty column > qualifier. > For row id let say it has something like below for each table ('|' is a > delimiter char in this context). > > Table1: > A|B|C > > Table2: > B|C|A > > Table3: > C|A|B > > So as we can see above, all of them pretty much have similar content (and > actually same row id length), and they all have same number of rows (I have > verified it): 2,181,193 rows. > However, when I check their table size I found different result: > root@dev> du -h -t Table1 > 17.70M [Table1] > root@dev> du -h -t Table2 > 27.58M [Table2] > root@dev> du -h -t Table3 > 32.48M [Table3] > > I am a bit surprised to see the different results, but I realize that > Accumulo applies compression to the data. Looking at those tables size > info, > am I right to conclude that A|B|C somehow seems have better compression > rate > than B|C|A, which apparently is better than C|A|B? > > With this fact, it makes my job a bit more difficult to tell management > disk > space estimation we need to store our data in Accumulo. Earlier I was > thinking if we can guesstimate how many rows we may have in the future, and > multiply it by the factor x (and perhaps also multiply by 3 for > replication), then that's the guesstimate I can give, but now I can't even > figure out that 'x'. Does any of you have experience on this, and perhaps > can share? > > Thanks, > Z > > > > -- > View this message in context: > http://apache-accumulo.1065345.n5.nabble.com/table-size-questions-tp15079.html > Sent from the Developers mailing list archive at Nabble.com. >