Hive and Lzo Compression

w00t w00t Thu, 08 Aug 2013 02:04:00 -0700


Hello,
 
I am started to run
Hive with Lzo compression on Hortonworks 1.2
 
I have managed to
install/configure Lzo and  hive -e
"set io.compression.codecs" shows me the Lzo Codecs:
io.compression.codecs=
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec
 
However, I have some
questions where I would be happy if you could help me.
(1) CREATE TABLE statement



I
read in different postings, that in the CREATE TABLE statement, I have to use
the following STORAGE clause:
 
CREATE
EXTERNAL TABLE txt_table_lzo (
   txt_line STRING
)
ROW
FORMAT DELIMITED FIELDS TERMINATED BY '||||'
STORED
AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/myuser/data/in/lzo_compressed';
 
It
works withouth any problems now to execute SELECT statements on this table with
Lzo data.
 
However
I also created a table on the same data without this STORAGE clause:
 
CREATE
EXTERNAL TABLE txt_table_lzo_tst (
   txt_line STRING
)
ROW
FORMAT DELIMITED FIELDS TERMINATED BY '||||'
LOCATION
'/user/myuser/data/in/lzo_compressed';
 
The
interesting thing is, it works as well, when I execute a SELECT statement and
this table.
 
Can
you help, why the second CREATE TABLE statement works as well?
What
should I use in DDLs? 
Is
it best practice to use the STORED AS clause with a 
"deprecatedLzoTextInputFormat"? Or should I remove it?
 
 (2) Output and Intermediate Compression Settings 
 
I
want to use output compression .
 
In
"Programming Hive" from Capriolo, Wampler, Rutherglen the following
commands are recommended:
SET
hive.exec.compress.output=true;
SET
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
 
          However, in some other places in
forums, I found the following recommended settings:
SET
hive.exec.compress.output=true
SET
mapreduce.output.fileoutputformat.compress=true
SET
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec
 
Am
I right, that the first settings are for Hadoop versions prior 0.23?
Or
is there any other reason why the settings are different?
 
I
am using Hadoop 1.1.2 with Hive 0.10.0.
Which
settings would you recommend to use?
 
--------------
          I also want to compress intermediate
results.
         
         Again, in  "Programming Hive" the following
settings are recommended:
         SET
hive.exec.compress.intermediate=true;
         SET
mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
 
          Is this the right setting?

          Or should I again use the
settings (which look more valid for Hadoop 0.23 and greater)?:
          SET
hive.exec.compress.intermediate=true;
          SET
mapreduce.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
 
Thanks

Hive and Lzo Compression

Reply via email to