Hello, I've seen issues similar to this one come up once or twice before, but I haven't ever seen a solution to the problem that I'm having. I was following the Compressed Storage page on the Hive Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that the sequence files that are created in the warehouse directory are actually uncompressed and larger than than the originals.
For example, I have a table 'test1' who's input data looks something like: 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341 ... And after creating a second table 'test1_comp' that was crated with the STORED AS SEQUENCEFILE directive and the compression options SET as described in the wiki, I can look at the resultant sequence files and see that they're just plain (uncompressed) text: SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text+�c�!Y�M ��Z^��=80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141= ... I've tried messing around with different org.apache.hadoop.io.compress.* options, but the sequence files always come out uncompressed. Has anybody ever seen this or know away to keep the data compressed? Since the input text is so uniform, we get huge space savings from compression and would like to store the data this way if possible. I'm using Hadoop 20.1 and Hive that I checked out from SVN about a week ago. Thanks, Brent