sum of mutation.numBytes() significantly different from rfile size

Slater, David M. Tue, 29 Oct 2013 14:51:20 -0700

Hello,

I'm seeing about an order of magnitude difference between the number of bytes 
returned by mutation.numBytes() and the size of the rfiles on disk (Accumulo 
1.4.2). Note that all of my mutations are new entries, and there are no 
combiners running.


While I understand that there is some compression on the rfile, I would be 
really surprised if it was 10:1.

My entries are composed of a row ID (most of which is equivalent to the 
previous row ID), an empty column family, a nonempty column qualifier (which 
likely shares a lot with the previous qualifier), and an empty value. An 
example of the rowID and column qualifier might be:

(forward table)
0000000000000|9|fa19                 IP|127.000.000.001
0000000000000|9|fa19                  PORT|00080
...
0000000000000|9|fa22                  IP|128.032.144.139
...
<timeblock>|<hash>|<uid>       <index>|<textual value>

OR
(reverse table)
0000000000000|IP|127.000.000.001         fa19
0000000000000|IP|127.000.000.001         fd02
0000000000000|IP|127.000.000.002         123
...
0000000000000|PORT|00080                      fa19

The numBytes() method appears to return a number of bytes equal to the string 
length of the row ID and column qualifiers, plus 26 * # of column qualifiers.

Is there something else that I'm missing, or would this possibly compress by 
that much?

Thanks,
David

sum of mutation.numBytes() significantly different from rfile size

Reply via email to