Hello, I've seen issues similar to this one come up once or twice before,
but I haven't ever seen a solution to the problem that I'm having. I was
following the Compressed Storage page on the Hive Wiki
http://wiki.apache.org/hadoop/CompressedStorage and realized that the
sequence files that are created in the warehouse directory are actually
uncompressed and larger than than the originals.

For example, I have a table 'test1' who's input data looks something like:

0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
...

And after creating a second table 'test1_comp' that was crated with the
STORED AS SEQUENCEFILE directive and the compression options SET as
described in the wiki, I can look at the resultant sequence files and see
that they're just plain (uncompressed) text:

SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text+�c�!Y�M
��Z^��=80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
...

I've tried messing around with different org.apache.hadoop.io.compress.*
options, but the sequence files always come out uncompressed. Has anybody
ever seen this or know away to keep the data compressed? Since the input
text is so uniform, we get huge space savings from compression and would
like to store the data this way if possible. I'm using Hadoop 20.1 and Hive
that I checked out from SVN about a week ago.

Thanks,
Brent

Reply via email to