Re: Hive compression with external table

2012-11-06 Thread Krishna Rao
Thanks for the reply. Compressed sequence files with compression might work.
However, it's not clear to me if it's possible to read Sequence files using
an external table.

On 5 November 2012 16:04, Edward Capriolo edlinuxg...@gmail.com wrote:

 Compression is a confusing issue. Sequence files that are in block
 format are always split table regardless of what compression for the
 block is chosen.The Programming Hive book has an entire section
 dedicated to the permutations of compression options.

 Edward
 On Mon, Nov 5, 2012 at 10:57 AM, Krishna Rao krishnanj...@gmail.com
 wrote:
  Hi all,
 
  I'm looking into finding a suitable format to store data in HDFS, so that
  it's available for processing by Hive. Ideally I would like to satisfy
 the
  following:
 
  1. store the data in a format that is readable by multiple Hadoop
 projects
  (eg. Pig, Mahout, etc.), not just Hive
  2. work with a Hive external table
  3. store data in a compressed format that is splittable
 
  (1) is a requirement because Hive isn't appropriate for all the problems
  that we want to throw at Hadoop.
 
  (2) is really more of a consequence of (1). Ideally we want the data
 stored
  in some open format that is compressed in HDFS.
  This way we can just point Hive, Pig, Mahout, etc at it depending on the
  problem.
 
  (3) is obviously so it plays well with Hadoop.
 
  Gzip is no good because it is not splittable. Snappy looked promising,
 but
  it is splittable only if used with a non-external Hive table.
  LZO also looked promising, but I wonder about whether it is future proof
  given the licencing issues surrounding it.
 
  So far, the only solution I could find that satisfies all the above
 seems to
  be bzip2 compression, but concerns about its performance make me wary
 about
  choosing it.
 
  Is bzip2 the only option I have? Or have I missed some other compression
  option?
 
  Cheers,
 
  Krishna



Re: Hive compression with external table

2012-11-06 Thread Bejoy KS
Hi Krishna

Sequence Files + Snappy compressed would be my recommendation as well. It can 
be processed by managed as well as external tables.

There is no difference in storage formats for managed and external tables. 


Also this can be consumed by mapred or pig directly.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Krishna Rao krishnanj...@gmail.com
Date: Tue, 6 Nov 2012 09:50:33 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Hive compression with external table

Thanks for the reply. Compressed sequence files with compression might work.
However, it's not clear to me if it's possible to read Sequence files using
an external table.

On 5 November 2012 16:04, Edward Capriolo edlinuxg...@gmail.com wrote:

 Compression is a confusing issue. Sequence files that are in block
 format are always split table regardless of what compression for the
 block is chosen.The Programming Hive book has an entire section
 dedicated to the permutations of compression options.

 Edward
 On Mon, Nov 5, 2012 at 10:57 AM, Krishna Rao krishnanj...@gmail.com
 wrote:
  Hi all,
 
  I'm looking into finding a suitable format to store data in HDFS, so that
  it's available for processing by Hive. Ideally I would like to satisfy
 the
  following:
 
  1. store the data in a format that is readable by multiple Hadoop
 projects
  (eg. Pig, Mahout, etc.), not just Hive
  2. work with a Hive external table
  3. store data in a compressed format that is splittable
 
  (1) is a requirement because Hive isn't appropriate for all the problems
  that we want to throw at Hadoop.
 
  (2) is really more of a consequence of (1). Ideally we want the data
 stored
  in some open format that is compressed in HDFS.
  This way we can just point Hive, Pig, Mahout, etc at it depending on the
  problem.
 
  (3) is obviously so it plays well with Hadoop.
 
  Gzip is no good because it is not splittable. Snappy looked promising,
 but
  it is splittable only if used with a non-external Hive table.
  LZO also looked promising, but I wonder about whether it is future proof
  given the licencing issues surrounding it.
 
  So far, the only solution I could find that satisfies all the above
 seems to
  be bzip2 compression, but concerns about its performance make me wary
 about
  choosing it.
 
  Is bzip2 the only option I have? Or have I missed some other compression
  option?
 
  Cheers,
 
  Krishna




Hive compression with external table

2012-11-05 Thread Krishna Rao
Hi all,

I'm looking into finding a suitable format to store data in HDFS, so that
it's available for processing by Hive. Ideally I would like to satisfy the
following:

1. store the data in a format that is readable by multiple Hadoop projects
(eg. Pig, Mahout, etc.), not just Hive
2. work with a Hive external table
3. store data in a compressed format that is splittable

(1) is a requirement because Hive isn't appropriate for all the problems
that we want to throw at Hadoop.

(2) is really more of a consequence of (1). Ideally we want the data stored
in some open format that is compressed in HDFS.
This way we can just point Hive, Pig, Mahout, etc at it depending on the
problem.

(3) is obviously so it plays well with Hadoop.

Gzip is no good because it is not splittable. Snappy looked promising, but
it is splittable only if used with a non-external Hive table.
LZO also looked promising, but I wonder about whether it is future proof
given the licencing issues surrounding it.

So far, the only solution I could find that satisfies all the above seems
to be bzip2 compression, but concerns about its performance make me wary
about choosing it.

Is bzip2 the only option I have? Or have I missed some other compression
option?

Cheers,

Krishna


Re: Hive compression with external table

2012-11-05 Thread Edward Capriolo
Compression is a confusing issue. Sequence files that are in block
format are always split table regardless of what compression for the
block is chosen.The Programming Hive book has an entire section
dedicated to the permutations of compression options.

Edward
On Mon, Nov 5, 2012 at 10:57 AM, Krishna Rao krishnanj...@gmail.com wrote:
 Hi all,

 I'm looking into finding a suitable format to store data in HDFS, so that
 it's available for processing by Hive. Ideally I would like to satisfy the
 following:

 1. store the data in a format that is readable by multiple Hadoop projects
 (eg. Pig, Mahout, etc.), not just Hive
 2. work with a Hive external table
 3. store data in a compressed format that is splittable

 (1) is a requirement because Hive isn't appropriate for all the problems
 that we want to throw at Hadoop.

 (2) is really more of a consequence of (1). Ideally we want the data stored
 in some open format that is compressed in HDFS.
 This way we can just point Hive, Pig, Mahout, etc at it depending on the
 problem.

 (3) is obviously so it plays well with Hadoop.

 Gzip is no good because it is not splittable. Snappy looked promising, but
 it is splittable only if used with a non-external Hive table.
 LZO also looked promising, but I wonder about whether it is future proof
 given the licencing issues surrounding it.

 So far, the only solution I could find that satisfies all the above seems to
 be bzip2 compression, but concerns about its performance make me wary about
 choosing it.

 Is bzip2 the only option I have? Or have I missed some other compression
 option?

 Cheers,

 Krishna