RE: BZip2 Splittable?

2012-02-27 Thread Daniel Baptista
Thanks to everyone with their help on this. 

We are currently using pig, but I don't think that this is something that we 
are currently using, I will pass this recommendation on!

Thanks again, Dan.

-Original Message-
From: Srinivas Surasani [mailto:hivehadooplearn...@gmail.com] 
Sent: 24 February 2012 21:08
To: common-user@hadoop.apache.org
Subject: Re: BZip2 Splittable?

@Daniel,

If you want to process bz2 files in parallel( more than one mapper/reducer
), you can go for Pig.

See below.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support
is coming soon). If the input file name extension is .bz2, Pig decompresses
the file on the fly and passes the decompressed input stream to your load
function.

Regards,


On Fri, Feb 24, 2012 at 2:59 PM, Rohit  wrote:

> Hi Daniel,
>
> Because your MapReduce jobs will not split bzip2 files, each entire bzip2
> file will be processed by one Map task. Thus, if your job takes multiple
> bzip2 text files as the input, then you'll have as many Map tasks as you
> have files running in parallel.
>
> The Map tasks will be run by your TaskTrackers. Usually the cluster setup
> has the DataNode and the TaskTracker processing running on the same
> machines - so with 6 data nodes, you have 6 tasktrackers.
>
> Hope that answers your question.
>
>
> Rohit Bakhshi
>
>
>
> www.hortonworks.com (http://www.hortonworks.com/)
>
>
>
> On Friday, February 24, 2012 at 7:59 AM, Daniel Baptista wrote:
> > Hi Rohit, thanks for the response, this is pretty much as I expected and
> hopefully adds weight to my other thoughts...
> >
> > Could this mean that all my datanodes are being sent all of the data or
> that only one datanode is executing the job.
> >
> > Thanks again , Dan.
> >
> > -Original Message-
> > From: Rohit Bakhshi [mailto:ro...@hortonworks.com]
> > Sent: 24 February 2012 15:54
> > To: common-user@hadoop.apache.org (mailto:common-user@hadoop.apache.org)
> > Subject: Re: BZip2 Splittable?
> >
> > Daniel,
> >
> > I just noticed your Hadoop version - 0.20.2.
> >
> > The JIRA fix below is for Hadoop 0.21.0, which is a different version.
> So it may not be supported on your version of Hadoop.
> >
> > --
> > Rohit Bakhshi
> > www.hortonworks.com (http://www.hortonworks.com/)
> >
> >
> >
> >
> > On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:
> >
> > > Hi Daniel,
> > >
> > > Bzip2 compression codec allows for splittable files.
> > >
> > > According to this Hadoop JIRA improvement, splitting of bzip2
> compressed files in Hadoop jobs is supported:
> > > https://issues.apache.org/jira/browse/HADOOP-4012
> > >
> > > --
> > > Rohit Bakhshi
> > > www.hortonworks.com (http://www.hortonworks.com/)
> > >
> > >
> > >
> > >
> > > On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have a cluster of 6 datanodes, all running hadoop version 0.20.2,
> r911707 that take a series of bzip2 compressed text files as input.
> > > >
> > > > I have read conflicting articles regarding whether or not hadoop can
> split these bzip2 files, can anyone give me a definite answer?
> > > >
> > > > Thanks is advance, Dan.
> >
> >
> > 
> >
> > CONFIDENTIALITY - This email and any files transmitted with it, are
> confidential, may be legally privileged and are intended solely for the use
> of the individual or entity to whom they are addressed. If this has come to
> you in error, you must not copy, distribute, disclose or use any of the
> information it contains. Please notify the sender immediately and delete
> them from your system.
> >
> > SECURITY - Please be aware that communication by email, by its very
> nature, is not 100% secure and by communicating with Perform Group by email
> you consent to us monitoring and reading any such correspondence.
> >
> > VIRUSES - Although this email message has been scanned for the presence
> of computer viruses, the sender accepts no liability for any damage
> sustained as a result of a computer virus and it is the recipient's
> responsibility to ensure that email is virus free.
> >
> > AUTHORITY - Any views or opinions expressed in this email are solely
> those of the sender and do not necessarily represent those of Perform Group.
> >
> > COPYRIGHT - Copyright of this email and any attachments belongs to
> Perform Group, Companies House Registration number 6324278.
>
>


-- 
Regards,
-- Srinivas
srini...@cloudwick.com


Re: BZip2 Splittable?

2012-02-24 Thread Srinivas Surasani
@Daniel,

If you want to process bz2 files in parallel( more than one mapper/reducer
), you can go for Pig.

See below.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support
is coming soon). If the input file name extension is .bz2, Pig decompresses
the file on the fly and passes the decompressed input stream to your load
function.

Regards,


On Fri, Feb 24, 2012 at 2:59 PM, Rohit  wrote:

> Hi Daniel,
>
> Because your MapReduce jobs will not split bzip2 files, each entire bzip2
> file will be processed by one Map task. Thus, if your job takes multiple
> bzip2 text files as the input, then you'll have as many Map tasks as you
> have files running in parallel.
>
> The Map tasks will be run by your TaskTrackers. Usually the cluster setup
> has the DataNode and the TaskTracker processing running on the same
> machines - so with 6 data nodes, you have 6 tasktrackers.
>
> Hope that answers your question.
>
>
> Rohit Bakhshi
>
>
>
> www.hortonworks.com (http://www.hortonworks.com/)
>
>
>
> On Friday, February 24, 2012 at 7:59 AM, Daniel Baptista wrote:
> > Hi Rohit, thanks for the response, this is pretty much as I expected and
> hopefully adds weight to my other thoughts...
> >
> > Could this mean that all my datanodes are being sent all of the data or
> that only one datanode is executing the job.
> >
> > Thanks again , Dan.
> >
> > -Original Message-
> > From: Rohit Bakhshi [mailto:ro...@hortonworks.com]
> > Sent: 24 February 2012 15:54
> > To: common-user@hadoop.apache.org (mailto:common-user@hadoop.apache.org)
> > Subject: Re: BZip2 Splittable?
> >
> > Daniel,
> >
> > I just noticed your Hadoop version - 0.20.2.
> >
> > The JIRA fix below is for Hadoop 0.21.0, which is a different version.
> So it may not be supported on your version of Hadoop.
> >
> > --
> > Rohit Bakhshi
> > www.hortonworks.com (http://www.hortonworks.com/)
> >
> >
> >
> >
> > On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:
> >
> > > Hi Daniel,
> > >
> > > Bzip2 compression codec allows for splittable files.
> > >
> > > According to this Hadoop JIRA improvement, splitting of bzip2
> compressed files in Hadoop jobs is supported:
> > > https://issues.apache.org/jira/browse/HADOOP-4012
> > >
> > > --
> > > Rohit Bakhshi
> > > www.hortonworks.com (http://www.hortonworks.com/)
> > >
> > >
> > >
> > >
> > > On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have a cluster of 6 datanodes, all running hadoop version 0.20.2,
> r911707 that take a series of bzip2 compressed text files as input.
> > > >
> > > > I have read conflicting articles regarding whether or not hadoop can
> split these bzip2 files, can anyone give me a definite answer?
> > > >
> > > > Thanks is advance, Dan.
> >
> >
> > 
> >
> > CONFIDENTIALITY - This email and any files transmitted with it, are
> confidential, may be legally privileged and are intended solely for the use
> of the individual or entity to whom they are addressed. If this has come to
> you in error, you must not copy, distribute, disclose or use any of the
> information it contains. Please notify the sender immediately and delete
> them from your system.
> >
> > SECURITY - Please be aware that communication by email, by its very
> nature, is not 100% secure and by communicating with Perform Group by email
> you consent to us monitoring and reading any such correspondence.
> >
> > VIRUSES - Although this email message has been scanned for the presence
> of computer viruses, the sender accepts no liability for any damage
> sustained as a result of a computer virus and it is the recipient’s
> responsibility to ensure that email is virus free.
> >
> > AUTHORITY - Any views or opinions expressed in this email are solely
> those of the sender and do not necessarily represent those of Perform Group.
> >
> > COPYRIGHT - Copyright of this email and any attachments belongs to
> Perform Group, Companies House Registration number 6324278.
>
>


-- 
Regards,
-- Srinivas
srini...@cloudwick.com


Re: BZip2 Splittable?

2012-02-24 Thread Rohit
Hi Daniel,  

Because your MapReduce jobs will not split bzip2 files, each entire bzip2 file 
will be processed by one Map task. Thus, if your job takes multiple bzip2 text 
files as the input, then you'll have as many Map tasks as you have files 
running in parallel.

The Map tasks will be run by your TaskTrackers. Usually the cluster setup has 
the DataNode and the TaskTracker processing running on the same machines - so 
with 6 data nodes, you have 6 tasktrackers.

Hope that answers your question.


Rohit Bakhshi



www.hortonworks.com (http://www.hortonworks.com/)



On Friday, February 24, 2012 at 7:59 AM, Daniel Baptista wrote:  
> Hi Rohit, thanks for the response, this is pretty much as I expected and 
> hopefully adds weight to my other thoughts...
>  
> Could this mean that all my datanodes are being sent all of the data or that 
> only one datanode is executing the job.  
>  
> Thanks again , Dan.
>  
> -Original Message-
> From: Rohit Bakhshi [mailto:ro...@hortonworks.com]  
> Sent: 24 February 2012 15:54
> To: common-user@hadoop.apache.org (mailto:common-user@hadoop.apache.org)
> Subject: Re: BZip2 Splittable?
>  
> Daniel,  
>  
> I just noticed your Hadoop version - 0.20.2.
>  
> The JIRA fix below is for Hadoop 0.21.0, which is a different version. So it 
> may not be supported on your version of Hadoop.  
>  
> --  
> Rohit Bakhshi
> www.hortonworks.com (http://www.hortonworks.com/)
>  
>  
>  
>  
> On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:
>  
> > Hi Daniel,  
> >  
> > Bzip2 compression codec allows for splittable files.
> >  
> > According to this Hadoop JIRA improvement, splitting of bzip2 compressed 
> > files in Hadoop jobs is supported:
> > https://issues.apache.org/jira/browse/HADOOP-4012
> >  
> > --  
> > Rohit Bakhshi
> > www.hortonworks.com (http://www.hortonworks.com/)
> >  
> >  
> >  
> >  
> > On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
> >  
> > > Hi All,
> > >  
> > > I have a cluster of 6 datanodes, all running hadoop version 0.20.2, 
> > > r911707 that take a series of bzip2 compressed text files as input.
> > >  
> > > I have read conflicting articles regarding whether or not hadoop can 
> > > split these bzip2 files, can anyone give me a definite answer?
> > >  
> > > Thanks is advance, Dan.  
>  
>  
> 
>  
> CONFIDENTIALITY - This email and any files transmitted with it, are 
> confidential, may be legally privileged and are intended solely for the use 
> of the individual or entity to whom they are addressed. If this has come to 
> you in error, you must not copy, distribute, disclose or use any of the 
> information it contains. Please notify the sender immediately and delete them 
> from your system.
>  
> SECURITY - Please be aware that communication by email, by its very nature, 
> is not 100% secure and by communicating with Perform Group by email you 
> consent to us monitoring and reading any such correspondence.
>  
> VIRUSES - Although this email message has been scanned for the presence of 
> computer viruses, the sender accepts no liability for any damage sustained as 
> a result of a computer virus and it is the recipient’s responsibility to 
> ensure that email is virus free.
>  
> AUTHORITY - Any views or opinions expressed in this email are solely those of 
> the sender and do not necessarily represent those of Perform Group.
>  
> COPYRIGHT - Copyright of this email and any attachments belongs to Perform 
> Group, Companies House Registration number 6324278.  



Re: BZip2 Splittable?

2012-02-24 Thread John Heidemann
On Fri, 24 Feb 2012 15:43:10 GMT, Daniel Baptista wrote: 
>Hi All,
>
>I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
>that take a series of bzip2 compressed text files as input.
>
>I have read conflicting articles regarding whether or not hadoop can split 
>these bzip2 files, can anyone give me a definite answer?
>
>Thanks is advance, Dan.

Support for bzip2 splitting was only added in 0.21.0; see 
https://issues.apache.org/jira/browse/MAPREDUCE-830

You need to roll forward (or backport the patch) if you want bzip2
splitting.

(And since 1.0.0 is a fork from 0.20-security, it also lacks bzip2
splitting, AFAIK.  Hopefully some future 1.x will pick up more of the
0.21 features.)

   -John Heidemann


RE: BZip2 Splittable?

2012-02-24 Thread Tim Broberg
Support starts in 0.21, yes. It will soon be backported and available in 1.1.0. 
A patch to 1.0.0 to enable bzip2 splittability is here, 
https://issues.apache.org/jira/browse/HADOOP-7823, if you feel up to patching 
and rebuilding.

- Tim.

From: Rohit Bakhshi [ro...@hortonworks.com]
Sent: Friday, February 24, 2012 7:53 AM
To: common-user@hadoop.apache.org
Subject: Re: BZip2 Splittable?

Daniel,

I just noticed your Hadoop version - 0.20.2.

The JIRA fix below is for Hadoop 0.21.0, which is a different version. So it 
may not be supported on your version of Hadoop.

--
Rohit Bakhshi
www.hortonworks.com (http://www.hortonworks.com/)




On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:

> Hi Daniel,
>
> Bzip2 compression codec allows for splittable files.
>
> According to this Hadoop JIRA improvement, splitting of bzip2 compressed 
> files in Hadoop jobs is supported:
> https://issues.apache.org/jira/browse/HADOOP-4012
>
> --
> Rohit Bakhshi
> www.hortonworks.com (http://www.hortonworks.com/)
>
>
>
>
> On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
>
> > Hi All,
> >
> > I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
> > that take a series of bzip2 compressed text files as input.
> >
> > I have read conflicting articles regarding whether or not hadoop can split 
> > these bzip2 files, can anyone give me a definite answer?
> >
> > Thanks is advance, Dan.
>

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.


RE: BZip2 Splittable?

2012-02-24 Thread Daniel Baptista
Hi Rohit, thanks for the response, this is pretty much as I expected and 
hopefully adds weight to my other thoughts...

Could this mean that all my datanodes are being sent all of the data or that 
only one datanode is executing the job. 

Thanks again , Dan.

-Original Message-
From: Rohit Bakhshi [mailto:ro...@hortonworks.com] 
Sent: 24 February 2012 15:54
To: common-user@hadoop.apache.org
Subject: Re: BZip2 Splittable?

Daniel, 

I just noticed your Hadoop version - 0.20.2.

The JIRA fix below is for Hadoop 0.21.0, which is a different version. So it 
may not be supported on your version of Hadoop. 

-- 
Rohit Bakhshi
www.hortonworks.com (http://www.hortonworks.com/)




On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:

> Hi Daniel, 
> 
> Bzip2 compression codec allows for splittable files.
> 
> According to this Hadoop JIRA improvement, splitting of bzip2 compressed 
> files in Hadoop jobs is supported:
> https://issues.apache.org/jira/browse/HADOOP-4012
> 
> -- 
> Rohit Bakhshi
> www.hortonworks.com (http://www.hortonworks.com/)
> 
> 
> 
> 
> On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
> 
> > Hi All,
> > 
> > I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
> > that take a series of bzip2 compressed text files as input.
> > 
> > I have read conflicting articles regarding whether or not hadoop can split 
> > these bzip2 files, can anyone give me a definite answer?
> > 
> > Thanks is advance, Dan. 
> 




CONFIDENTIALITY - This email and any files transmitted with it, are 
confidential, may be legally privileged and are intended solely for the use of 
the individual or entity to whom they are addressed. If this has come to you in 
error, you must not copy, distribute, disclose or use any of the information it 
contains. Please notify the sender immediately and delete them from your system.

SECURITY - Please be aware that communication by email, by its very nature, is 
not 100% secure and by communicating with Perform Group by email you consent to 
us monitoring and reading any such correspondence.

VIRUSES - Although this email message has been scanned for the presence of 
computer viruses, the sender accepts no liability for any damage sustained as a 
result of a computer virus and it is the recipient’s responsibility to ensure 
that email is virus free.

AUTHORITY - Any views or opinions expressed in this email are solely those of 
the sender and do not necessarily represent those of Perform Group.

COPYRIGHT - Copyright of this email and any attachments belongs to Perform 
Group, Companies House Registration number 6324278.


Re: BZip2 Splittable?

2012-02-24 Thread Rohit Bakhshi
Daniel, 

I just noticed your Hadoop version - 0.20.2.

The JIRA fix below is for Hadoop 0.21.0, which is a different version. So it 
may not be supported on your version of Hadoop. 

-- 
Rohit Bakhshi
www.hortonworks.com (http://www.hortonworks.com/)




On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:

> Hi Daniel, 
> 
> Bzip2 compression codec allows for splittable files.
> 
> According to this Hadoop JIRA improvement, splitting of bzip2 compressed 
> files in Hadoop jobs is supported:
> https://issues.apache.org/jira/browse/HADOOP-4012
> 
> -- 
> Rohit Bakhshi
> www.hortonworks.com (http://www.hortonworks.com/)
> 
> 
> 
> 
> On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
> 
> > Hi All,
> > 
> > I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
> > that take a series of bzip2 compressed text files as input.
> > 
> > I have read conflicting articles regarding whether or not hadoop can split 
> > these bzip2 files, can anyone give me a definite answer?
> > 
> > Thanks is advance, Dan. 
> 



Re: BZip2 Splittable?

2012-02-24 Thread Rohit Bakhshi
Hi Daniel, 

Bzip2 compression codec allows for splittable files.

According to this Hadoop JIRA improvement, splitting of bzip2 compressed files 
in Hadoop jobs is supported:
https://issues.apache.org/jira/browse/HADOOP-4012

-- 
Rohit Bakhshi
www.hortonworks.com (http://www.hortonworks.com/)




On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:

> Hi All,
> 
> I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
> that take a series of bzip2 compressed text files as input.
> 
> I have read conflicting articles regarding whether or not hadoop can split 
> these bzip2 files, can anyone give me a definite answer?
> 
> Thanks is advance, Dan. 



BZip2 Splittable?

2012-02-24 Thread Daniel Baptista
Hi All,

I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
that take a series of bzip2 compressed text files as input.

I have read conflicting articles regarding whether or not hadoop can split 
these bzip2 files, can anyone give me a definite answer?

Thanks is advance, Dan.