Hey - not sure if anyone responded. Sequencefiles are the way to go if u want parallelism on the files as well (since gz compressed files cannot be split).
One simple way to do this is to start with text files, build (potentially an external) table on them - and load them into another table that is declared to be stored as a sequencefile. the load can simply be a 'insert overwrite table XXX select * from YYY' on the first table (YYY). The first table is just a tmp table used to do the loading. Whether the data is compressed or not as a result is controlled by the hive option 'hive.exec.compress.output'. if this is set to true - the codec used is whatever is dictated by hadoop options that control the codec. The relevant options are: mapred.output.compression.codec mapred.output.compression.type u want to set them to org.apache.hadoop.io.compress.GzipCodec and BLOCK respectively. Hope this helps, Joydeep -----Original Message----- From: Bob Schulze [mailto:b.schu...@ecircle.com] Sent: Wednesday, March 18, 2009 8:07 AM To: hive-user@hadoop.apache.org Subject: Keeping Data compressed Hi, I want to keep data in hadoop compressed, ready for hive-selects to access. Is using sequencefiles with compression the way to go? How can I get my data into hive tables "as sequencefile", with an underlaying compression? Thx for any ideas, Bob