Re: SequenceFile compression on Amazon EMR not very good

Saurabh Nanda Fri, 19 Feb 2010 05:53:53 -0800

And also hive.exec.compress.*. So that makes it three sets of configuration
variables:


mapred.output.compress.*
io.seqfile.compress.*
hive.exec.compress.*

What's the relationship between these configuration parameters and which
ones should I set to achieve a well compress output table?

Saurabh.

On Fri, Feb 19, 2010 at 7:16 PM, Saurabh Nanda <saurabhna...@gmail.com>wrote:

> I'm confused here Zheng. There are two sets of configuration variables.
> Those starting with io.* and those starting with mapred.*. For making sure
> that the final output table is compressed, which ones do I have to set?
>
> Saurabh.
>
>
> On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao <zsh...@gmail.com> wrote:
>
>> Did you also:
>>
>> SET mapred.output.compression.codec=org.apache....GZipCode;
>>
>> Zheng
>>
>> On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <saurabhna...@gmail.com>
>> wrote:
>> > Hi Zheng,
>> >
>> > I cross checked. I am setting the following in my Hive script before the
>> > INSERT command:
>> >
>> > SET io.seqfile.compression.type=BLOCK;
>> > SET hive.exec.compress.output=true;
>> >
>> > A 132 MB (gzipped) input file going through a cleanup and getting
>> populated
>> > in a sequencefile table is growing to 432 MB. What could be going wrong?
>> >
>> > Saurabh.
>> >
>> > On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <saurabhna...@gmail.com>
>> > wrote:
>> >>
>> >> Thanks, Zheng. Will do some more tests and get back.
>> >>
>> >> Saurabh.
>> >>
>> >> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <zsh...@gmail.com> wrote:
>> >>>
>> >>> I would first check whether it is really the block compression or
>> >>> record compression.
>> >>> Also maybe the block size is too small but I am not sure that is
>> >>> tunable in SequenceFile or not.
>> >>>
>> >>> Zheng
>> >>>
>> >>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <
>> saurabhna...@gmail.com>
>> >>> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > The size of my Gzipped weblog files is about 35MB. However, upon
>> >>> > enabling
>> >>> > block compression, and inserting the logs into another Hive table
>> >>> > (sequencefile), the file size bloats up to about 233MB. I've done
>> >>> > similar
>> >>> > processing on a local Hadoop/Hive cluster, and while the
>> compressions
>> >>> > is not
>> >>> > as good as gzipping, it still is not this bad. What could be going
>> >>> > wrong?
>> >>> >
>> >>> > I looked at the header of the resulting file and here's what it
>> says:
>> >>> >
>> >>> >
>> >>> >
>> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
>> >>> >
>> >>> > Does Amazon Elastic MapReduce behave differently or am I doing
>> >>> > something
>> >>> > wrong?
>> >>> >
>> >>> > Saurabh.
>> >>> > --
>> >>> > http://nandz.blogspot.com
>> >>> > http://foodieforlife.blogspot.com
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Yours,
>> >>> Zheng
>> >>
>> >>
>> >>
>> >> --
>> >> http://nandz.blogspot.com
>> >> http://foodieforlife.blogspot.com
>> >
>> >
>> >
>> > --
>> > http://nandz.blogspot.com
>> > http://foodieforlife.blogspot.com
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>
>
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: SequenceFile compression on Amazon EMR not very good

Reply via email to