GZ typically compresses text fairly well (assuming that's the
compression codec that you're using).
I don't believe 1.4 has anything extra at the RFile level for size
savings; however, I think that 1.5+ has some additional encoding to
reduce the size on disk.
On 10/29/13, 5:50 PM, Slater, David M. wrote:
Hello,
I’m seeing about an order of magnitude difference between the number of
bytes returned by mutation.numBytes() and the size of the rfiles on disk
(Accumulo 1.4.2). Note that all of my mutations are new entries, and
there are no combiners running.
While I understand that there is some compression on the rfile, I would
be really surprised if it was 10:1.
My entries are composed of a row ID (most of which is equivalent to the
previous row ID), an empty column family, a nonempty column qualifier
(which likely shares a lot with the previous qualifier), and an empty
value. An example of the rowID and column qualifier might be:
(forward table)
0000000000000|9|fa19 IP|127.000.000.001
0000000000000|9|fa19 PORT|00080
…
0000000000000|9|fa22 IP|128.032.144.139
…
<timeblock>|<hash>|<uid> <index>|<textual value>
OR
(reverse table)
0000000000000|IP|127.000.000.001 fa19
0000000000000|IP|127.000.000.001 fd02
0000000000000|IP|127.000.000.002 123
…
0000000000000|PORT|00080 fa19
The numBytes() method appears to return a number of bytes equal to the
string length of the row ID and column qualifiers, plus 26 * # of column
qualifiers.
Is there something else that I’m missing, or would this possibly
compress by that much?
Thanks,
David