Hadoop 0.20.2 and JobConf deprecation

2011-11-03 Thread Tony Burton
Hi

After a while away from Hadoop coding, I'm refreshing myself with a walkthrough 
of the Hadoop 0.20.2 WordCount tutorial at 
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html. I've got the 
hadoop-0.20.2-*.jar files included on my Build Path, and I'm using Eclipse 
Helios.

In my main() or run() methods (depending on whether I'm looking at WordCount v1 
or v2), my JobConf object is deprecated (it's org.apache.hadoop.mapred.JobConf 
to be more precise). Had JobConf been replaced by something else, without the 
tutorial being updated? Also, if this is the case, is there a more accurate 
WordCount example for Hadoop 0.20.2 somewhere, or something equivalent?

Thanks

Tony

**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM


has bzip2 compression been deprecated?

2012-01-09 Thread Tony Burton
Hi,

I'm trying to work out which compression algorithm I should be using in my 
MapReduce jobs.  It seems to me that the best solution is a compromise between 
speed, efficiency and splittability. The only compression algorithm to handle 
file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is 
bzip2, at the expense of compression speed.

However, I see from the documentation at 
http://hadoop.apache.org/common/docs/current/native_libraries.html that the 
bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see 
http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however 
the bzip2 Codec is still in the API at 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.

Has bzip2 support been removed from Hadoop, or will it be removed soon?

Thanks,

Tony



**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM


RE: has bzip2 compression been deprecated?

2012-01-09 Thread Tony Burton
Thanks for the quick reply and the clarification about the documentation.

Regarding sequence files: am I right in thinking that they're a good choice for 
intermediate steps in chained MR jobs, or for file transfer between the Map and 
the Reduce phases of a job; but they shouldn't be used for human-readable files 
at the end of one or more MapReduce jobs? How about if the only use a job's 
output is analysis via Hive - can Hive create tables from sequence files? 

Tony



-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: 09 January 2012 15:34
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Bzip2 is pretty slow. You probably do not want to use it, even if it does file 
splits (a feature not available in the stable line of 0.20.x/1.x, but available 
in 0.22+).

To answer your question though, bzip2 was removed from that document cause it 
isn't a native library (its pure Java). I think bzip2 was added earlier due to 
an oversight, as even 0.20 did not have a native bzip2 library. This change in 
docs does not mean that BZip2 is deprecated -- it is still fully supported and 
available in the trunk as well. See 
https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes 
that led to this.

The best way would be to use either:

(a) Hadoop sequence files with any compression codec of choice (best would be 
lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is 
splittable. Another choice would be Avro DataFiles from the Apache Avro project.
(b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and 
hadoop-lzo-packager for packages). This requires you to run indexing operations 
before the .lzo can be made splittable, but works great with this extra step 
added.

On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:

> Hi,
> 
> I'm trying to work out which compression algorithm I should be using in my 
> MapReduce jobs.  It seems to me that the best solution is a compromise 
> between speed, efficiency and splittability. The only compression algorithm 
> to handle file splits (according to Hadoop: The Definitive Guide 2nd edition 
> p78 etc) is bzip2, at the expense of compression speed.
> 
> However, I see from the documentation at 
> http://hadoop.apache.org/common/docs/current/native_libraries.html that the 
> bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, 
> see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - 
> however the bzip2 Codec is still in the API at 
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
> 
> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> 
> Thanks,
> 
> Tony
> 
> 
> 
> **
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued 
> and approved by Sporting Index Ltd.
> 
> Outbound email has been scanned for viruses and SPAM

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any lo

RE: has bzip2 compression been deprecated?

2012-01-09 Thread Tony Burton
Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the 
impression that the STORED AS part of a CREATE TABLE in Hive refers to how the 
data in the table will be stored once the table is created, rather than the 
compression format of the data used to populate the table. Can you clarify 
which is the correct interpretation? If it's the latter, how would I read a 
sequence file into a Hive table?

Thanks,

Tony




-Original Message-
From: Bejoy Ks [mailto:bejoy.had...@gmail.com] 
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
   Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...

STORED AS SEQUENCEFILE;


Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt  wrote:

> Tony,
>
> snappy is also available:
> http://code.google.com/p/hadoop-snappy/
>
> best,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -Original Message-
> >> From: Harsh J [mailto:ha...@cloudera.com]
> >> Sent: 09 January 2012 15:34
> >> To: common-user@hadoop.apache.org
> >> Subject: Re: has bzip2 compression been deprecated?
> >>
> >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> does file splits (a feature not available in the stable line of 0.20.x/1.x,
> but available in 0.22+).
> >>
> >> To answer your question though, bzip2 was removed from that document
> cause it isn't a native library (its pure Java). I think bzip2 was added
> earlier due to an oversight, as even 0.20 did not have a native bzip2
> library. This change in docs does not mean that BZip2 is deprecated -- it
> is still fully supported and available in the trunk as well. See
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> changes that led to this.
> >>
> >> The best way would be to use either:
> >>
> >> (a) Hadoop sequence files with any compression codec of choice (best
> would be lzo, gz, maybe even snappy). This file format is built for HDFS
> and MR and is splittable. Another choice would be Avro DataFiles from the
> Apache Avro project.
> >> (b) LZO codecs for Hadoop, via 
> >> https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for 
> >> packages). This requires you to run indexing
> operations before the .lzo can be made splittable, but works great with
> this extra step added.
> >>
> >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm trying to work out which compression algorithm I should be using
> in my MapReduce jobs.  It seems to me that the best solution is a
> compromise between speed, efficiency and splittability. The only
> compression algorithm to handle file splits (according to Hadoop: The
> Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> compression speed.
> >>>
> >>> However, I see from the documentation at
> http://hadoop.apache.org/common/docs/current/native_libraries.html that
> the bzip2 library is no longer mentioned, and hasn't been since version
> 0.20.0, see
> http://hadoop.apache.org/common/docs/r0.20.0/native_libra

RE: has bzip2 compression been deprecated?

2012-01-10 Thread Tony Burton
Thanks all for advice - one more question on re-reading Harsh's helpful reply. 
" Intermediate (M-to-R) files use a custom IFile format these days". How 
recently is "these days", and can this addition be pinned down to any one 
version of Hadoop?

Tony





-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: 09 January 2012 16:50
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Tony,

* Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it out 
(instead of a plain "fs -cat"). But if you are gonna export your files into a 
system you do not have much control over, probably best to have the resultant 
files not be in SequenceFile/Avro-DataFile format.
* Intermediate (M-to-R) files use a custom IFile format these days, which is 
built purely for that purpose.
* Hive can use SequenceFiles very well. There is also documented info on this 
in the Hive's wiki pages (Check the DDL pages, IIRC).

On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:

> Thanks for the quick reply and the clarification about the documentation.
> 
> Regarding sequence files: am I right in thinking that they're a good choice 
> for intermediate steps in chained MR jobs, or for file transfer between the 
> Map and the Reduce phases of a job; but they shouldn't be used for 
> human-readable files at the end of one or more MapReduce jobs? How about if 
> the only use a job's output is analysis via Hive - can Hive create tables 
> from sequence files? 
> 
> Tony
> 
> 
> 
> -Original Message-
> From: Harsh J [mailto:ha...@cloudera.com] 
> Sent: 09 January 2012 15:34
> To: common-user@hadoop.apache.org
> Subject: Re: has bzip2 compression been deprecated?
> 
> Bzip2 is pretty slow. You probably do not want to use it, even if it does 
> file splits (a feature not available in the stable line of 0.20.x/1.x, but 
> available in 0.22+).
> 
> To answer your question though, bzip2 was removed from that document cause it 
> isn't a native library (its pure Java). I think bzip2 was added earlier due 
> to an oversight, as even 0.20 did not have a native bzip2 library. This 
> change in docs does not mean that BZip2 is deprecated -- it is still fully 
> supported and available in the trunk as well. See 
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes 
> that led to this.
> 
> The best way would be to use either:
> 
> (a) Hadoop sequence files with any compression codec of choice (best would be 
> lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is 
> splittable. Another choice would be Avro DataFiles from the Apache Avro 
> project.
> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and 
> hadoop-lzo-packager for packages). This requires you to run indexing 
> operations before the .lzo can be made splittable, but works great with this 
> extra step added.
> 
> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> 
>> Hi,
>> 
>> I'm trying to work out which compression algorithm I should be using in my 
>> MapReduce jobs.  It seems to me that the best solution is a compromise 
>> between speed, efficiency and splittability. The only compression algorithm 
>> to handle file splits (according to Hadoop: The Definitive Guide 2nd edition 
>> p78 etc) is bzip2, at the expense of compression speed.
>> 
>> However, I see from the documentation at 
>> http://hadoop.apache.org/common/docs/current/native_libraries.html that the 
>> bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, 
>> see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - 
>> however the bzip2 Codec is still in the API at 
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
>> 
>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
>> 
>> Thanks,
>> 
>> Tony
>> 
>> 
>> 
>> **
>> This email and any attachments are confidential, protected by copyright and 
>> may be legally privileged.  If you are not the intended recipient, then the 
>> dissemination or copying of this email is prohibited. If you have received 
>> this in error, please notify the sender by replying by email and then delete 
>> the email completely from your system.  Neither Sporting Index nor the 
>> sender accepts responsibility for any virus, or any other defect which might 
>> affect any computer or IT system into which the email is received and/or 
>> opened.  It is the responsibility of th

RE: has bzip2 compression been deprecated?

2012-01-10 Thread Tony Burton
Thanks for this Bejoy, very helpful. 

So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED AS, ROW 
FORMAT and other parameters you mention are telling Hive what to expect when it 
reads the data I want to analyse, despite not checking the data to see if it 
meets these criteria?

Do these guidelines still apply if the table is not EXTERNAL?

Tony

 

-Original Message-
From: Bejoy Ks [mailto:bejoy.had...@gmail.com] 
Sent: 09 January 2012 19:00
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
   As  I understand your requirement, your mapreduce job produces a
Sequence File as ouput and you need to use this file as an input to hive
table.
When you CREATE and EXTERNAL Table in hive you specify a location
where your data is stored and also what is the format of that data( like
the field delimiter,row delimiter, file type etc of your data). You are
actually not loading data any where when you create a hive external
table(issue DDL), just specifying where the data lies in file system in
fact there is not even any validation performed that time to check on the
data quality. When you Query/Retrive your data  through Hive QLs the
parameters specified along with CREATE TABLE as ROW FORMAT,FILEDS
TERMINATED, STORED AS etc are used to execute the right MAP REDUCE job(s).

 In short STORED AS refer to the type of files that a table's data
directory holds.

For details
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Hope it helps!..

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 11:32 PM, Tony Burton wrote:

> Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was
> under the impression that the STORED AS part of a CREATE TABLE in Hive
> refers to how the data in the table will be stored once the table is
> created, rather than the compression format of the data used to populate
> the table. Can you clarify which is the correct interpretation? If it's the
> latter, how would I read a sequence file into a Hive table?
>
> Thanks,
>
> Tony
>
>
>
>
> -Original Message-
> From: Bejoy Ks [mailto:bejoy.had...@gmail.com]
> Sent: 09 January 2012 17:33
> To: common-user@hadoop.apache.org
> Subject: Re: has bzip2 compression been deprecated?
>
> Hi Tony
>   Adding on to Harsh's comments. If you want the generated sequence
> files to be utilized by a hive table. Define your hive table as
>
> CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
> ...
> ...
> 
> STORED AS SEQUENCEFILE;
>
>
> Regards
> Bejoy.K.S
>
> On Mon, Jan 9, 2012 at 10:32 PM, alo.alt  wrote:
>
> > Tony,
> >
> > snappy is also available:
> > http://code.google.com/p/hadoop-snappy/
> >
> > best,
> >  Alex
> >
> > --
> > Alexander Lorenz
> > http://mapredit.blogspot.com
> >
> > On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
> >
> > > Tony,
> > >
> > > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> > out (instead of a plain "fs -cat"). But if you are gonna export your
> files
> > into a system you do not have much control over, probably best to have
> the
> > resultant files not be in SequenceFile/Avro-DataFile format.
> > > * Intermediate (M-to-R) files use a custom IFile format these days,
> > which is built purely for that purpose.
> > > * Hive can use SequenceFiles very well. There is also documented info
> on
> > this in the Hive's wiki pages (Check the DDL pages, IIRC).
> > >
> > > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> > >
> > >> Thanks for the quick reply and the clarification about the
> > documentation.
> > >>
> > >> Regarding sequence files: am I right in thinking that they're a good
> > choice for intermediate steps in chained MR jobs, or for file transfer
> > between the Map and the Reduce phases of a job; but they shouldn't be
> used
> > for human-readable files at the end of one or more MapReduce jobs? How
> > about if the only use a job's output is analysis via Hive - can Hive
> create
> > tables from sequence files?
> > >>
> > >> Tony
> > >>
> > >>
> > >>
> > >> -Original Message-
> > >> From: Harsh J [mailto:ha...@cloudera.com]
> > >> Sent: 09 January 2012 15:34
> > >> To: common-user@hadoop.apache.org
> > >> Subject: Re: has bzip2 compression been deprecated?
> > >>
> > >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> > does file sp

decompressing bzip2 data with a custom InputFormat

2012-03-12 Thread Tony Burton
 Hi,

I'm setting up a map-only job that reads large bzip2-compressed data files, 
parses the XML and writes out the same data in plain text format. My XML 
InputFormat extends TextInputFormat and has a RecordReader based upon the one 
you can see at http://xmlandhadoop.blogspot.com/ (my version of it works great 
for uncompressed XML input data). For compressed data, I've added 
io.compression.codecs to my core-site.xml and set it to 
o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2.

Have I forgotten something basic when running a Hadoop job to read compressed 
data? Or, given that I've written my own InputFormat, should I be using an 
InputStream that can carry out the decompression itself?

Thanks

Tony
 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM


RE: decompressing bzip2 data with a custom InputFormat

2012-03-14 Thread Tony Burton
Hi - sorry to bump this, but I'm having trouble resolving this. 

Essentially the question is: If I create my own InputFormat by subclassing 
TextInputFormat, does the subclass have to handle its own streaming of 
compressed data? If so, can anyone point me at an example where this is done?

Thanks!

Tony







-Original Message-
From: Tony Burton [mailto:tbur...@sportingindex.com] 
Sent: 12 March 2012 18:05
To: common-user@hadoop.apache.org
Subject: decompressing bzip2 data with a custom InputFormat

 Hi,

I'm setting up a map-only job that reads large bzip2-compressed data files, 
parses the XML and writes out the same data in plain text format. My XML 
InputFormat extends TextInputFormat and has a RecordReader based upon the one 
you can see at http://xmlandhadoop.blogspot.com/ (my version of it works great 
for uncompressed XML input data). For compressed data, I've added 
io.compression.codecs to my core-site.xml and set it to 
o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2.

Have I forgotten something basic when running a Hadoop job to read compressed 
data? Or, given that I've written my own InputFormat, should I be using an 
InputStream that can carry out the decompression itself?

Thanks

Tony
 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM
www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM


RE: decompressing bzip2 data with a custom InputFormat

2012-03-16 Thread Tony Burton

Cool - thanks for the confirmation and link, Joey, very helpful.





-Original Message-
From: Joey Echeverria [mailto:j...@cloudera.com] 
Sent: 14 March 2012 19:03
To: common-user@hadoop.apache.org
Subject: Re: decompressing bzip2 data with a custom InputFormat

Yes you have to deal with the compression. Usually, you'll load the
compression codec in your RecordReader. You can see an example of how
TextInputFormat's LineRecordReader does it:

https://github.com/apache/hadoop-common/blob/release-1.0.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java

-Joey

On Wed, Mar 14, 2012 at 11:08 AM, Tony Burton  wrote:
> Hi - sorry to bump this, but I'm having trouble resolving this.
>
> Essentially the question is: If I create my own InputFormat by subclassing 
> TextInputFormat, does the subclass have to handle its own streaming of 
> compressed data? If so, can anyone point me at an example where this is done?
>
> Thanks!
>
> Tony
>
>
>
>
>
>
>
> -Original Message-
> From: Tony Burton [mailto:tbur...@sportingindex.com]
> Sent: 12 March 2012 18:05
> To: common-user@hadoop.apache.org
> Subject: decompressing bzip2 data with a custom InputFormat
>
>  Hi,
>
> I'm setting up a map-only job that reads large bzip2-compressed data files, 
> parses the XML and writes out the same data in plain text format. My XML 
> InputFormat extends TextInputFormat and has a RecordReader based upon the one 
> you can see at http://xmlandhadoop.blogspot.com/ (my version of it works 
> great for uncompressed XML input data). For compressed data, I've added 
> io.compression.codecs to my core-site.xml and set it to 
> o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2.
>
> Have I forgotten something basic when running a Hadoop job to read compressed 
> data? Or, given that I've written my own InputFormat, should I be using an 
> InputStream that can carry out the decompression itself?
>
> Thanks
>
> Tony
>
> **
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434
www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
***

MapReduce on autocomplete

2012-03-28 Thread Tony Burton
So I have a lot of small files on S3 that I need to consolidate, so headed to 
Google to see the best way to do it in a MapReduce job. Looks like someone's 
got a different idea, according to Google's autocomplete:

[cid:image001.jpg@01CD0D09.CDEB9E90]


**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM


RE: MapReduce on autocomplete

2012-03-29 Thread Tony Burton
Thanks for the heads-up Harsh:

http://dl.dropbox.com/u/6327451/mapreduce.jpg

:)




-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: 28 March 2012 18:27
To: common-user@hadoop.apache.org
Subject: Re: MapReduce on autocomplete

Looks like your mail client (or the list) stripped away your image
attachment. Could you post the image as a link from imageshack/etc.
instead?

On Wed, Mar 28, 2012 at 10:10 PM, Tony Burton  wrote:
>
> So I have a lot of small files on S3 that I need to consolidate, so headed to 
> Google to see the best way to do it in a MapReduce job. Looks like someone's 
> got a different idea, according to Google's autocomplete:
>
>
>
>
>
>
>
> P Think of the environment: please don't print this email unless you really 
> need to.
>
> Outbound Email has been scanned for viruses and SPAM
>
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged. If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system. Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened. It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email. Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued 
> and approved by Sporting Index Ltd.




--
Harsh J
www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM


RE: MapReduce on autocomplete

2012-03-29 Thread Tony Burton
I think we've all been there at some point, haven't we?


-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: 29 March 2012 10:28
To: common-user@hadoop.apache.org
Subject: Re: MapReduce on autocomplete

I swear I wasn't the one who searched that.

On Thu, Mar 29, 2012 at 2:24 PM, Tony Burton  wrote:
> Thanks for the heads-up Harsh:
>
> http://dl.dropbox.com/u/6327451/mapreduce.jpg
>
> :)
>
>
>
>
> -Original Message-
> From: Harsh J [mailto:ha...@cloudera.com]
> Sent: 28 March 2012 18:27
> To: common-user@hadoop.apache.org
> Subject: Re: MapReduce on autocomplete
>
> Looks like your mail client (or the list) stripped away your image
> attachment. Could you post the image as a link from imageshack/etc.
> instead?
>
> On Wed, Mar 28, 2012 at 10:10 PM, Tony Burton  
> wrote:
>>
>> So I have a lot of small files on S3 that I need to consolidate, so headed 
>> to Google to see the best way to do it in a MapReduce job. Looks like 
>> someone's got a different idea, according to Google's autocomplete:
>>
>>
>>
>>
>>
>>
>>
>> P Think of the environment: please don't print this email unless you really 
>> need to.
>>
>> Outbound Email has been scanned for viruses and SPAM
>>
>> This email and any attachments are confidential, protected by copyright and 
>> may be legally privileged. If you are not the intended recipient, then the 
>> dissemination or copying of this email is prohibited. If you have received 
>> this in error, please notify the sender by replying by email and then delete 
>> the email completely from your system. Neither Sporting Index nor the sender 
>> accepts responsibility for any virus, or any other defect which might affect 
>> any computer or IT system into which the email is received and/or opened. It 
>> is the responsibility of the recipient to scan the email and no 
>> responsibility is accepted for any loss or damage arising in any way from 
>> receipt or use of this email. Sporting Index Ltd is a company registered in 
>> England and Wales with company number 2636842, whose registered office is at 
>> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting 
>> Index Ltd is authorised and regulated by the UK Financial Services Authority 
>> (reg. no. 150404). Any financial promotion contained herein has been issued 
>> and approved by Sporting Index Ltd.
>
>
>
>
> --
> Harsh J
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM



-- 
Harsh J
www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a c