decompressing bzip2 data with a custom InputFormat

2012-03-12 Thread Tony Burton
 Hi,

I'm setting up a map-only job that reads large bzip2-compressed data files, 
parses the XML and writes out the same data in plain text format. My XML 
InputFormat extends TextInputFormat and has a RecordReader based upon the one 
you can see at http://xmlandhadoop.blogspot.com/ (my version of it works great 
for uncompressed XML input data). For compressed data, I've added 
io.compression.codecs to my core-site.xml and set it to 
o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2.

Have I forgotten something basic when running a Hadoop job to read compressed 
data? Or, given that I've written my own InputFormat, should I be using an 
InputStream that can carry out the decompression itself?

Thanks

Tony
 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM


RE: decompressing bzip2 data with a custom InputFormat

2012-03-14 Thread Tony Burton
Hi - sorry to bump this, but I'm having trouble resolving this. 

Essentially the question is: If I create my own InputFormat by subclassing 
TextInputFormat, does the subclass have to handle its own streaming of 
compressed data? If so, can anyone point me at an example where this is done?

Thanks!

Tony







-Original Message-
From: Tony Burton [mailto:tbur...@sportingindex.com] 
Sent: 12 March 2012 18:05
To: common-user@hadoop.apache.org
Subject: decompressing bzip2 data with a custom InputFormat

 Hi,

I'm setting up a map-only job that reads large bzip2-compressed data files, 
parses the XML and writes out the same data in plain text format. My XML 
InputFormat extends TextInputFormat and has a RecordReader based upon the one 
you can see at http://xmlandhadoop.blogspot.com/ (my version of it works great 
for uncompressed XML input data). For compressed data, I've added 
io.compression.codecs to my core-site.xml and set it to 
o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2.

Have I forgotten something basic when running a Hadoop job to read compressed 
data? Or, given that I've written my own InputFormat, should I be using an 
InputStream that can carry out the decompression itself?

Thanks

Tony
 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM
www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM


Re: decompressing bzip2 data with a custom InputFormat

2012-03-14 Thread Joey Echeverria
Yes you have to deal with the compression. Usually, you'll load the
compression codec in your RecordReader. You can see an example of how
TextInputFormat's LineRecordReader does it:

https://github.com/apache/hadoop-common/blob/release-1.0.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java

-Joey

On Wed, Mar 14, 2012 at 11:08 AM, Tony Burton  wrote:
> Hi - sorry to bump this, but I'm having trouble resolving this.
>
> Essentially the question is: If I create my own InputFormat by subclassing 
> TextInputFormat, does the subclass have to handle its own streaming of 
> compressed data? If so, can anyone point me at an example where this is done?
>
> Thanks!
>
> Tony
>
>
>
>
>
>
>
> -Original Message-
> From: Tony Burton [mailto:tbur...@sportingindex.com]
> Sent: 12 March 2012 18:05
> To: common-user@hadoop.apache.org
> Subject: decompressing bzip2 data with a custom InputFormat
>
>  Hi,
>
> I'm setting up a map-only job that reads large bzip2-compressed data files, 
> parses the XML and writes out the same data in plain text format. My XML 
> InputFormat extends TextInputFormat and has a RecordReader based upon the one 
> you can see at http://xmlandhadoop.blogspot.com/ (my version of it works 
> great for uncompressed XML input data). For compressed data, I've added 
> io.compression.codecs to my core-site.xml and set it to 
> o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2.
>
> Have I forgotten something basic when running a Hadoop job to read compressed 
> data? Or, given that I've written my own InputFormat, should I be using an 
> InputStream that can carry out the decompression itself?
>
> Thanks
>
> Tony
>
> **
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


RE: decompressing bzip2 data with a custom InputFormat

2012-03-16 Thread Tony Burton

Cool - thanks for the confirmation and link, Joey, very helpful.





-Original Message-
From: Joey Echeverria [mailto:j...@cloudera.com] 
Sent: 14 March 2012 19:03
To: common-user@hadoop.apache.org
Subject: Re: decompressing bzip2 data with a custom InputFormat

Yes you have to deal with the compression. Usually, you'll load the
compression codec in your RecordReader. You can see an example of how
TextInputFormat's LineRecordReader does it:

https://github.com/apache/hadoop-common/blob/release-1.0.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java

-Joey

On Wed, Mar 14, 2012 at 11:08 AM, Tony Burton  wrote:
> Hi - sorry to bump this, but I'm having trouble resolving this.
>
> Essentially the question is: If I create my own InputFormat by subclassing 
> TextInputFormat, does the subclass have to handle its own streaming of 
> compressed data? If so, can anyone point me at an example where this is done?
>
> Thanks!
>
> Tony
>
>
>
>
>
>
>
> -Original Message-
> From: Tony Burton [mailto:tbur...@sportingindex.com]
> Sent: 12 March 2012 18:05
> To: common-user@hadoop.apache.org
> Subject: decompressing bzip2 data with a custom InputFormat
>
>  Hi,
>
> I'm setting up a map-only job that reads large bzip2-compressed data files, 
> parses the XML and writes out the same data in plain text format. My XML 
> InputFormat extends TextInputFormat and has a RecordReader based upon the one 
> you can see at http://xmlandhadoop.blogspot.com/ (my version of it works 
> great for uncompressed XML input data). For compressed data, I've added 
> io.compression.codecs to my core-site.xml and set it to 
> o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2.
>
> Have I forgotten something basic when running a Hadoop job to read compressed 
> data? Or, given that I've written my own InputFormat, should I be using an 
> InputStream that can carry out the decompression itself?
>
> Thanks
>
> Tony
>
> **
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434
www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
***