Thanks all for advice - one more question on re-reading Harsh's helpful reply. 
" Intermediate (M-to-R) files use a custom IFile format these days". How 
recently is "these days", and can this addition be pinned down to any one 
version of Hadoop?

Tony





-----Original Message-----
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: 09 January 2012 16:50
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Tony,

* Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it out 
(instead of a plain "fs -cat"). But if you are gonna export your files into a 
system you do not have much control over, probably best to have the resultant 
files not be in SequenceFile/Avro-DataFile format.
* Intermediate (M-to-R) files use a custom IFile format these days, which is 
built purely for that purpose.
* Hive can use SequenceFiles very well. There is also documented info on this 
in the Hive's wiki pages (Check the DDL pages, IIRC).

On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:

> Thanks for the quick reply and the clarification about the documentation.
> 
> Regarding sequence files: am I right in thinking that they're a good choice 
> for intermediate steps in chained MR jobs, or for file transfer between the 
> Map and the Reduce phases of a job; but they shouldn't be used for 
> human-readable files at the end of one or more MapReduce jobs? How about if 
> the only use a job's output is analysis via Hive - can Hive create tables 
> from sequence files? 
> 
> Tony
> 
> 
> 
> -----Original Message-----
> From: Harsh J [mailto:ha...@cloudera.com] 
> Sent: 09 January 2012 15:34
> To: common-user@hadoop.apache.org
> Subject: Re: has bzip2 compression been deprecated?
> 
> Bzip2 is pretty slow. You probably do not want to use it, even if it does 
> file splits (a feature not available in the stable line of 0.20.x/1.x, but 
> available in 0.22+).
> 
> To answer your question though, bzip2 was removed from that document cause it 
> isn't a native library (its pure Java). I think bzip2 was added earlier due 
> to an oversight, as even 0.20 did not have a native bzip2 library. This 
> change in docs does not mean that BZip2 is deprecated -- it is still fully 
> supported and available in the trunk as well. See 
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes 
> that led to this.
> 
> The best way would be to use either:
> 
> (a) Hadoop sequence files with any compression codec of choice (best would be 
> lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is 
> splittable. Another choice would be Avro DataFiles from the Apache Avro 
> project.
> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and 
> hadoop-lzo-packager for packages). This requires you to run indexing 
> operations before the .lzo can be made splittable, but works great with this 
> extra step added.
> 
> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> 
>> Hi,
>> 
>> I'm trying to work out which compression algorithm I should be using in my 
>> MapReduce jobs.  It seems to me that the best solution is a compromise 
>> between speed, efficiency and splittability. The only compression algorithm 
>> to handle file splits (according to Hadoop: The Definitive Guide 2nd edition 
>> p78 etc) is bzip2, at the expense of compression speed.
>> 
>> However, I see from the documentation at 
>> http://hadoop.apache.org/common/docs/current/native_libraries.html that the 
>> bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, 
>> see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - 
>> however the bzip2 Codec is still in the API at 
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
>> 
>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
>> 
>> Thanks,
>> 
>> Tony
>> 
>> 
>> 
>> **********************************************************************
>> This email and any attachments are confidential, protected by copyright and 
>> may be legally privileged.  If you are not the intended recipient, then the 
>> dissemination or copying of this email is prohibited. If you have received 
>> this in error, please notify the sender by replying by email and then delete 
>> the email completely from your system.  Neither Sporting Index nor the 
>> sender accepts responsibility for any virus, or any other defect which might 
>> affect any computer or IT system into which the email is received and/or 
>> opened.  It is the responsibility of the recipient to scan the email and no 
>> responsibility is accepted for any loss or damage arising in any way from 
>> receipt or use of this email.  Sporting Index Ltd is a company registered in 
>> England and Wales with company number 2636842, whose registered office is at 
>> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
>> Index Ltd is authorised and regulated by the UK Financial Services Authority 
>> (reg. no. 150404). Any financial promotion contained herein has been issued 
>> and approved by Sporting Index Ltd.
>> 
>> Outbound email has been scanned for viruses and SPAM
> 
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM 
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued 
> and approved by Sporting Index Ltd.
> 
> Outbound email has been scanned for viruses and SPAM

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**********************************************************************
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

Reply via email to