Hey - not sure if anyone responded.

Sequencefiles are the way to go if u want parallelism on the files as well 
(since gz compressed files cannot be split). 

One simple way to do this is to start with text files, build (potentially an 
external) table on them - and load them into another table that is declared to 
be stored as a sequencefile. the load can simply be a 'insert overwrite table 
XXX select * from YYY' on the first table (YYY). The first table is just a tmp 
table used to do the loading.

Whether the data is compressed or not as a result is controlled by the hive 
option 'hive.exec.compress.output'. if this is set to true - the codec used is 
whatever is dictated by hadoop options that control the codec. The relevant 
options are:

mapred.output.compression.codec
mapred.output.compression.type

u want to set them to org.apache.hadoop.io.compress.GzipCodec and BLOCK 
respectively.

Hope this helps,

Joydeep

-----Original Message-----
From: Bob Schulze [mailto:b.schu...@ecircle.com] 
Sent: Wednesday, March 18, 2009 8:07 AM
To: hive-user@hadoop.apache.org
Subject: Keeping Data compressed

Hi,

        I want to keep data in hadoop compressed, ready for hive-selects to
access.

Is using sequencefiles with compression the way to go?

How can I get my data into hive tables "as sequencefile", with an
underlaying compression?

Thx for any ideas,

        Bob

Reply via email to