Accumulo sorts keys and then compresses them in blocks. Each block is
compressed without any information external to the block. You're going to
see different compression ratios depending on the relative entropy of keys
inside of each compressed block. For purpose of discussion, let's simplify
this to two elements in the row: A|B or B|A

Suppose set A is {0,1,2,3,4}, and set B is {maroon,orange,purple,yellow}.
We can make the following keys:

0|maroon
0|orange
0|purple
0|yellow
1|maroon
...

and

maroon|0
maroon|1
maroon|2
maroon|3
orange|0
orange|1
...

Now, let's also assume that a block fits 4 keys in it. In the A|B case the
first block has to represent the following:

{0,maroon,orange,purple,yellow}

In the B|A case the first block has to represent the following:

{maroon,0,1,2,3}

The block in the A|B case has a higher relative entropy, since the B set
contains more information and we have to represent the entire B set in the
block. You can see that visually depicted here in that the string
representation of the information in the A|B block is twice as long as the
string representation of the information in the B|A block. This is
admittedly a crude example, but hopefully it helps you see some of the
elements that contribute to compression ratio.

Practically speaking, the best way to get an estimate for the size of a
table is to put in some real data and take measurements. Try to add data in
such a way that your compressed blocks are going to be similar to those
that will be there in a full table. So, in the A|B case sample from A and
use a complete B, and in the B|A case sample from B and use a complete set
A. If you make your blocks representative of the full table then a linear
extrapolation will give you a pretty good estimate for size. Doing this
piece-wise for each of the types of blocks (tables, in your case) should
also work.

Hope that helps!

Adam


On Tue, Sep 8, 2015 at 9:19 AM, z11373 <z11...@outlook.com> wrote:

> I have 3 tables, all of them have same column family name, and empty column
> qualifier.
> For row id let say it has something like below for each table ('|' is a
> delimiter char in this context).
>
> Table1:
> A|B|C
>
> Table2:
> B|C|A
>
> Table3:
> C|A|B
>
> So as we can see above, all of them pretty much have similar content (and
> actually same row id length), and they all have same number of rows (I have
> verified it): 2,181,193 rows.
> However, when I check their table size I found different result:
> root@dev> du -h -t Table1
>    17.70M [Table1]
> root@dev> du -h -t Table2
>    27.58M [Table2]
> root@dev> du -h -t Table3
>    32.48M [Table3]
>
> I am a bit surprised to see the different results, but I realize that
> Accumulo applies compression to the data. Looking at those tables size
> info,
> am I right to conclude that A|B|C somehow seems have better compression
> rate
> than B|C|A, which apparently is better than C|A|B?
>
> With this fact, it makes my job a bit more difficult to tell management
> disk
> space estimation we need to store our data in Accumulo. Earlier I was
> thinking if we can guesstimate how many rows we may have in the future, and
> multiply it by the factor x (and perhaps also multiply by 3 for
> replication), then that's the guesstimate I can give, but now I can't even
> figure out that 'x'. Does any of you have experience on this, and perhaps
> can share?
>
> Thanks,
> Z
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/table-size-questions-tp15079.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Reply via email to