RE: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-06 Thread Andy Sautins

   Abdul,

   Please note that I applied patch 4012 version 4 to release 0.19.0 and
re-ran tests with mixed results.  My simple test ( 20 million simple
records ) for both pbzip2/bzip2 generated the same correct results which
is great.  However, a larger test case ( described in more detail below
) had a discrepancy in the results when compared to gzip and plain text
files. bzip2/gzip/text all had produced the same results pre-patch.  The
bzip2 run had 3 additional records compared to the text/gzip runs post
patch.

   The following are timings and results for a sample dataset running a
simple MapReduce job ( MapReduce version of unix 'wc' ).  Note the
dataset consists of 11 files that are a total of 27G uncompressed, 4.5G
gzip compressed and 3.1G bzip2 compressed.  All 3 datasets are identical
and produce the same md5sum.  Also the bzip2 files in the test were
compressed using bzip2, not pbzip2.

Release .19.0 Pre patch:
   TypeTiming   MapReduce Result   
--
   Gzip  - 4m55s323,234,098
   Bzip2 - 16m14s   323,234,098
   Txt   - 6m23s323,234,098

Release .19.0 Post patch 4012 Version 4 ( w /results )
   TypeTiming   MapReduce Result   
--
   Gzip  - 5m14s332,234,098
   Bzip2 - 9m36s332,234,101
   Txt   - 6m28s332.234.098

   Both Gzip/Txt timings were about the same between runs.  Bzip2
elapsed time was reduced significantly.

   So, generally positive although looks like there might be an
edge-case causing slightly different results.  I'll work on putting
together a test case of manageable size that re-produces the result
discrepancy.  

   Thanks again for the help.

   Andy

-Original Message-
From: Andy Sautins [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 04, 2008 2:29 PM
To: core-user@hadoop.apache.org
Subject: RE: Strange behavior with bzip2 input files w/release 0.19.0


   Thanks Abdul.  Very exciting that hadoop will soon be able to handle
not only pbzip2 files but also be able to split bzip2 files.  

   I will apply the patch and report back. 

   Thank you

   Andy

-Original Message-
From: Abdul Qadeer [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 04, 2008 1:49 PM
To: core-user@hadoop.apache.org
Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0

Andy,

As was mentioned earlier that splitting support is being added for bzip2
files
and actually patch is under review now.  I think, pbzip2 generated files
should
work fine with that because the split algorithm finds the next start of
block
marker and does not use end of stream marker.  We rather use physical
end of file to know when stream ends.


So if you see at https://issues.apache.org/jira/browse/HADOOP-4012
you can download version 4 patch and apply it on Hadoop code and see
if its working for you or you can wait for the review process to
complete
so that code becomes a part of standard Hadoop.  You can add yourself
as a watcher there at JIRA 4012, so that you know when its done.  Please
let me know, if pbzip2 generated files does not work even on that code.

Thank you,
Abdul Qadeer


On Thu, Dec 4, 2008 at 11:46 AM, Andy Sautins
<[EMAIL PROTECTED]>wrote:

>
>   Thanks for the response Abdul.
>
>   So, the bzip2 file in question is _kindof_ a concatenation of
> multiple bzip2 files.  It's not concatenated using cat a.bz2 b.bz2 >
> yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running
on
> CentOS 5.2 installed from the EPEL repository ).  My understanding is
> that pbzip does roughly what you're saying and concatenates in some
> manner.
>
>   I created a simple test case that reproduces the behavior.  I
created
> a file using the following perl script:
>
> for($i=0;$i<2000;$i++) {
>  print "Line $i\n";
> }
>
>I then created two different bzip2 files.  One with bzip2 and one
> with pbzip2.  The do have different sizes:
>
> 21994233 simple.bzip2.txt.bz2
> 21999416 simple.pbzip2.txt.bz2
>
>They do decompress to give the same output file
> bunzip2 -c simple.bzip2.txt.bz2 | md5sum
> 581ad242e6cf22650072edd44d6a2d38  -
>
> bunzip2 -c simple.pbzip2.txt.bz2 | md5sum
> 581ad242e6cf22650072edd44d6a2d38  -
>
>   Running both through the simple line count MapReduce job I get the
> same behavior where bzip2 correctly calculates 20,000,000 records, but
> the pbzip2 generated file only processes the first block ( 82,829
> records ).
>
>   So, it sounds like what you're saying of having multiple end of
> stream markers makes sense.  I will say it would be very beneficial to
> be able to use pbzip2 generated files to compress hadoop input files.
> Using pbzip2 can greatly reduce the amount of time required to bzip2
> compress files and seems to generate a valid bzip2 file ( at least it
> bunzip2 decompresses cor

Re: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-05 Thread John Heidemann
On Thu, 04 Dec 2008 09:55:35 PST, "Alex Loddengaard" wrote: 
>Currently in Hadoop you cannot split bzip2 files:
>
>
>
>However, gzip files can be split:
>
>
>
>Hope this helps.

Just to clarify, gzip files are only sort of split---it's only one file
per "split", not many splits per file.  For many of our datasets we have
only a few large files, so this level of split support is a serious
limitation to parallelism.  THis limitation is (I believe)
fundamental to gzip where the decompression state is never checkpointed.

This limitation is what prompted us to add support for bzip2 and bzip2
splitting, although splitting support is only in progress as Abdul said.

   -John Heidemann

>
>Alex
>
>On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins <[EMAIL PROTECTED]>wrote:
>
>>
>>
>>I'm seeing some strange behavior with bzip2 files and release
>> 0.19.0.  I'm wondering if anyone can shed some light on what I'm seeing.
>> Basically it _looks_ like the processing of a particular bzip2 input
>> file is stopping after the first bzip2 block.  Below is a comparison of
>> tests  between a .gz file which seems to do what I expect, and the same
>> file .bz2 which doesn't behave as I expect.
>>
>>
>>
>>I have the same file stored in hadoop compressed as both bzip2 and
>> gz formats.  The uncompressed file size is 660,841,894 bytes.  Comparing
>> the files they both seem to be valid archives of the exact same file.
>>
>>
>>
>> /usr/local/hadoop/bin/hadoop dfs -cat
>> bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum
>>
>> 2c82901170f44245fb04d24ad4746e38  -
>>
>>
>>
>> /usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz
>> | gunzip -c | md5sum
>>
>> 2c82901170f44245fb04d24ad4746e38  -
>>
>>
>>
>>Given the md5 sums match it seems like the files are the same and
>> uncompress correctly.
>>
>>
>>
>>Now when I run a simple Map/Reduce application that just counts
>> lines in the file I get different results.
>>
>>
>>
>>  Expected Results:
>>
>>
>>
>>  /usr/local/hadoop/bin/hadoop dfs -cat
>> bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l
>>
>> 6884024
>>
>>
>>
>>   Gzip input file Results: 6,884,024
>>
>>   Bzip2 input file Results: 9,420
>>
>>
>>
>>
>>
>>   Looking at the task log files the MAP_INPUT_BYTES of the .gz file
>> looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and
>> matches the size of the uncompressed file.  However, looking at
>> MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map
>> input bytes)(90)] ) which matches the block size of the bzip2
>> compressed file.  So that makes me think for some reason that only the
>> first bzip2 block of the bzip2 compressed file is being processed.
>>
>>
>>
>>So I'm wondering if my analysis is correct and if there could be an
>> issue with the processing of bzip2 input files.
>>
>>
>>
>>   Andy
>>
>>


RE: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-04 Thread Andy Sautins

   Thanks Abdul.  Very exciting that hadoop will soon be able to handle
not only pbzip2 files but also be able to split bzip2 files.  

   I will apply the patch and report back. 

   Thank you

   Andy

-Original Message-
From: Abdul Qadeer [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 04, 2008 1:49 PM
To: core-user@hadoop.apache.org
Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0

Andy,

As was mentioned earlier that splitting support is being added for bzip2
files
and actually patch is under review now.  I think, pbzip2 generated files
should
work fine with that because the split algorithm finds the next start of
block
marker and does not use end of stream marker.  We rather use physical
end of file to know when stream ends.


So if you see at https://issues.apache.org/jira/browse/HADOOP-4012
you can download version 4 patch and apply it on Hadoop code and see
if its working for you or you can wait for the review process to
complete
so that code becomes a part of standard Hadoop.  You can add yourself
as a watcher there at JIRA 4012, so that you know when its done.  Please
let me know, if pbzip2 generated files does not work even on that code.

Thank you,
Abdul Qadeer


On Thu, Dec 4, 2008 at 11:46 AM, Andy Sautins
<[EMAIL PROTECTED]>wrote:

>
>   Thanks for the response Abdul.
>
>   So, the bzip2 file in question is _kindof_ a concatenation of
> multiple bzip2 files.  It's not concatenated using cat a.bz2 b.bz2 >
> yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running
on
> CentOS 5.2 installed from the EPEL repository ).  My understanding is
> that pbzip does roughly what you're saying and concatenates in some
> manner.
>
>   I created a simple test case that reproduces the behavior.  I
created
> a file using the following perl script:
>
> for($i=0;$i<2000;$i++) {
>  print "Line $i\n";
> }
>
>I then created two different bzip2 files.  One with bzip2 and one
> with pbzip2.  The do have different sizes:
>
> 21994233 simple.bzip2.txt.bz2
> 21999416 simple.pbzip2.txt.bz2
>
>They do decompress to give the same output file
> bunzip2 -c simple.bzip2.txt.bz2 | md5sum
> 581ad242e6cf22650072edd44d6a2d38  -
>
> bunzip2 -c simple.pbzip2.txt.bz2 | md5sum
> 581ad242e6cf22650072edd44d6a2d38  -
>
>   Running both through the simple line count MapReduce job I get the
> same behavior where bzip2 correctly calculates 20,000,000 records, but
> the pbzip2 generated file only processes the first block ( 82,829
> records ).
>
>   So, it sounds like what you're saying of having multiple end of
> stream markers makes sense.  I will say it would be very beneficial to
> be able to use pbzip2 generated files to compress hadoop input files.
> Using pbzip2 can greatly reduce the amount of time required to bzip2
> compress files and seems to generate a valid bzip2 file ( at least it
> bunzip2 decompresses correctly ).
>
>   Thank you
>
>   Andy
>
> -Original Message-----
> From: Abdul Qadeer [mailto:[EMAIL PROTECTED]
> Sent: Thursday, December 04, 2008 12:07 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0
>
> Andy,
>
> As you said, you suspect that only one bzip2 block is being
decompressed
> and used; is you bzip2 file the concatenation of multiple bzip2 files
> (i.e.
> are
> you doing something like cat a.bz2 b.bz2 c.bz2 > yourFile.bz2 ?)  In
> such
> a case, there will be many bzip2 end of stream markers in a single
file
> and
> bzip2 decomprssor will stop on encountering the first end of block
> marker
> when in fact, the stream has more data in it.
>
> If this is not the case, then bzip2 should work as gzip or plaintext
are
> working.
> Currently only one mapper gets the whole file (just like gzip and
> splitting
> support
> for bzip is being added in HADOOP-4012, as Alex mentioned).  The
> LineRecordReader
> get the uncompressed data and does rest of the things same as in the
> case
> of gzip or plaintext.  So can you provide your bzip2 compressed file?
> (May
> be
> uploading it somewhere and sending in the link)  I will look into this
> issue.
>
>
> Abdul Qadeer
>
> On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins
> <[EMAIL PROTECTED]>wrote:
>
> >
> >
> >I'm seeing some strange behavior with bzip2 files and release
> > 0.19.0.  I'm wondering if anyone can shed some light on what I'm
> seeing.
> > Basically it _looks_ like the processing of a particular bzip2 input
> > file is stopping after the first bzip2 block.  Below is a comparison
> of
> > tests  between a .gz file which seems to do what I expect, and the
> same
> &

Re: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-04 Thread Abdul Qadeer
Andy,

As was mentioned earlier that splitting support is being added for bzip2
files
and actually patch is under review now.  I think, pbzip2 generated files
should
work fine with that because the split algorithm finds the next start of
block
marker and does not use end of stream marker.  We rather use physical
end of file to know when stream ends.


So if you see at https://issues.apache.org/jira/browse/HADOOP-4012
you can download version 4 patch and apply it on Hadoop code and see
if its working for you or you can wait for the review process to complete
so that code becomes a part of standard Hadoop.  You can add yourself
as a watcher there at JIRA 4012, so that you know when its done.  Please
let me know, if pbzip2 generated files does not work even on that code.

Thank you,
Abdul Qadeer


On Thu, Dec 4, 2008 at 11:46 AM, Andy Sautins
<[EMAIL PROTECTED]>wrote:

>
>   Thanks for the response Abdul.
>
>   So, the bzip2 file in question is _kindof_ a concatenation of
> multiple bzip2 files.  It's not concatenated using cat a.bz2 b.bz2 >
> yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running on
> CentOS 5.2 installed from the EPEL repository ).  My understanding is
> that pbzip does roughly what you're saying and concatenates in some
> manner.
>
>   I created a simple test case that reproduces the behavior.  I created
> a file using the following perl script:
>
> for($i=0;$i<2000;$i++) {
>  print "Line $i\n";
> }
>
>I then created two different bzip2 files.  One with bzip2 and one
> with pbzip2.  The do have different sizes:
>
> 21994233 simple.bzip2.txt.bz2
> 21999416 simple.pbzip2.txt.bz2
>
>They do decompress to give the same output file
> bunzip2 -c simple.bzip2.txt.bz2 | md5sum
> 581ad242e6cf22650072edd44d6a2d38  -
>
> bunzip2 -c simple.pbzip2.txt.bz2 | md5sum
> 581ad242e6cf22650072edd44d6a2d38  -
>
>   Running both through the simple line count MapReduce job I get the
> same behavior where bzip2 correctly calculates 20,000,000 records, but
> the pbzip2 generated file only processes the first block ( 82,829
> records ).
>
>   So, it sounds like what you're saying of having multiple end of
> stream markers makes sense.  I will say it would be very beneficial to
> be able to use pbzip2 generated files to compress hadoop input files.
> Using pbzip2 can greatly reduce the amount of time required to bzip2
> compress files and seems to generate a valid bzip2 file ( at least it
> bunzip2 decompresses correctly ).
>
>   Thank you
>
>   Andy
>
> -Original Message-----
> From: Abdul Qadeer [mailto:[EMAIL PROTECTED]
> Sent: Thursday, December 04, 2008 12:07 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0
>
> Andy,
>
> As you said, you suspect that only one bzip2 block is being decompressed
> and used; is you bzip2 file the concatenation of multiple bzip2 files
> (i.e.
> are
> you doing something like cat a.bz2 b.bz2 c.bz2 > yourFile.bz2 ?)  In
> such
> a case, there will be many bzip2 end of stream markers in a single file
> and
> bzip2 decomprssor will stop on encountering the first end of block
> marker
> when in fact, the stream has more data in it.
>
> If this is not the case, then bzip2 should work as gzip or plaintext are
> working.
> Currently only one mapper gets the whole file (just like gzip and
> splitting
> support
> for bzip is being added in HADOOP-4012, as Alex mentioned).  The
> LineRecordReader
> get the uncompressed data and does rest of the things same as in the
> case
> of gzip or plaintext.  So can you provide your bzip2 compressed file?
> (May
> be
> uploading it somewhere and sending in the link)  I will look into this
> issue.
>
>
> Abdul Qadeer
>
> On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins
> <[EMAIL PROTECTED]>wrote:
>
> >
> >
> >I'm seeing some strange behavior with bzip2 files and release
> > 0.19.0.  I'm wondering if anyone can shed some light on what I'm
> seeing.
> > Basically it _looks_ like the processing of a particular bzip2 input
> > file is stopping after the first bzip2 block.  Below is a comparison
> of
> > tests  between a .gz file which seems to do what I expect, and the
> same
> > file .bz2 which doesn't behave as I expect.
> >
> >
> >
> >I have the same file stored in hadoop compressed as both bzip2 and
> > gz formats.  The uncompressed file size is 660,841,894 bytes.
> Comparing
> > the files they both seem to be valid archives of the exact same file.
> >
> >
> >
> > /usr/local/hadoop/bin/hadoop dfs -cat
> > bzip2.examp

RE: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-04 Thread Andy Sautins

   Thanks for the response Abdul.

   So, the bzip2 file in question is _kindof_ a concatenation of
multiple bzip2 files.  It's not concatenated using cat a.bz2 b.bz2 >
yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running on
CentOS 5.2 installed from the EPEL repository ).  My understanding is
that pbzip does roughly what you're saying and concatenates in some
manner.

   I created a simple test case that reproduces the behavior.  I created
a file using the following perl script:

for($i=0;$i<2000;$i++) {
  print "Line $i\n";
}

I then created two different bzip2 files.  One with bzip2 and one
with pbzip2.  The do have different sizes:

21994233 simple.bzip2.txt.bz2
21999416 simple.pbzip2.txt.bz2

They do decompress to give the same output file
bunzip2 -c simple.bzip2.txt.bz2 | md5sum
581ad242e6cf22650072edd44d6a2d38  -

bunzip2 -c simple.pbzip2.txt.bz2 | md5sum
581ad242e6cf22650072edd44d6a2d38  -

   Running both through the simple line count MapReduce job I get the
same behavior where bzip2 correctly calculates 20,000,000 records, but
the pbzip2 generated file only processes the first block ( 82,829
records ).  

   So, it sounds like what you're saying of having multiple end of
stream markers makes sense.  I will say it would be very beneficial to
be able to use pbzip2 generated files to compress hadoop input files.
Using pbzip2 can greatly reduce the amount of time required to bzip2
compress files and seems to generate a valid bzip2 file ( at least it
bunzip2 decompresses correctly ).

   Thank you

   Andy

-Original Message-
From: Abdul Qadeer [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 04, 2008 12:07 PM
To: core-user@hadoop.apache.org
Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0

Andy,

As you said, you suspect that only one bzip2 block is being decompressed
and used; is you bzip2 file the concatenation of multiple bzip2 files
(i.e.
are
you doing something like cat a.bz2 b.bz2 c.bz2 > yourFile.bz2 ?)  In
such
a case, there will be many bzip2 end of stream markers in a single file
and
bzip2 decomprssor will stop on encountering the first end of block
marker
when in fact, the stream has more data in it.

If this is not the case, then bzip2 should work as gzip or plaintext are
working.
Currently only one mapper gets the whole file (just like gzip and
splitting
support
for bzip is being added in HADOOP-4012, as Alex mentioned).  The
LineRecordReader
get the uncompressed data and does rest of the things same as in the
case
of gzip or plaintext.  So can you provide your bzip2 compressed file?
(May
be
uploading it somewhere and sending in the link)  I will look into this
issue.


Abdul Qadeer

On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins
<[EMAIL PROTECTED]>wrote:

>
>
>I'm seeing some strange behavior with bzip2 files and release
> 0.19.0.  I'm wondering if anyone can shed some light on what I'm
seeing.
> Basically it _looks_ like the processing of a particular bzip2 input
> file is stopping after the first bzip2 block.  Below is a comparison
of
> tests  between a .gz file which seems to do what I expect, and the
same
> file .bz2 which doesn't behave as I expect.
>
>
>
>I have the same file stored in hadoop compressed as both bzip2 and
> gz formats.  The uncompressed file size is 660,841,894 bytes.
Comparing
> the files they both seem to be valid archives of the exact same file.
>
>
>
> /usr/local/hadoop/bin/hadoop dfs -cat
> bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum
>
> 2c82901170f44245fb04d24ad4746e38  -
>
>
>
> /usr/local/hadoop/bin/hadoop dfs -cat
bzip2.example/data.gz/file.txt.gz
> | gunzip -c | md5sum
>
> 2c82901170f44245fb04d24ad4746e38  -
>
>
>
>Given the md5 sums match it seems like the files are the same and
> uncompress correctly.
>
>
>
>Now when I run a simple Map/Reduce application that just counts
> lines in the file I get different results.
>
>
>
>  Expected Results:
>
>
>
>  /usr/local/hadoop/bin/hadoop dfs -cat
> bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l
>
> 6884024
>
>
>
>   Gzip input file Results: 6,884,024
>
>   Bzip2 input file Results: 9,420
>
>
>
>
>
>   Looking at the task log files the MAP_INPUT_BYTES of the .gz file
> looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and
> matches the size of the uncompressed file.  However, looking at
> MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map
> input bytes)(90)] ) which matches the block size of the bzip2
> compressed file.  So that makes me think for some reason that only the
> first bzip2 block of the bzip2 compressed file is being processed.
>
>
>
>So I'm wondering if my analysis is correct and if there could be an
> issue with the processing of bzip2 input files.
>
>
>
>   Andy
>
>


Re: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-04 Thread Abdul Qadeer
Andy,

As you said, you suspect that only one bzip2 block is being decompressed
and used; is you bzip2 file the concatenation of multiple bzip2 files (i.e.
are
you doing something like cat a.bz2 b.bz2 c.bz2 > yourFile.bz2 ?)  In such
a case, there will be many bzip2 end of stream markers in a single file and
bzip2 decomprssor will stop on encountering the first end of block marker
when in fact, the stream has more data in it.

If this is not the case, then bzip2 should work as gzip or plaintext are
working.
Currently only one mapper gets the whole file (just like gzip and splitting
support
for bzip is being added in HADOOP-4012, as Alex mentioned).  The
LineRecordReader
get the uncompressed data and does rest of the things same as in the case
of gzip or plaintext.  So can you provide your bzip2 compressed file?  (May
be
uploading it somewhere and sending in the link)  I will look into this
issue.


Abdul Qadeer

On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins <[EMAIL PROTECTED]>wrote:

>
>
>I'm seeing some strange behavior with bzip2 files and release
> 0.19.0.  I'm wondering if anyone can shed some light on what I'm seeing.
> Basically it _looks_ like the processing of a particular bzip2 input
> file is stopping after the first bzip2 block.  Below is a comparison of
> tests  between a .gz file which seems to do what I expect, and the same
> file .bz2 which doesn't behave as I expect.
>
>
>
>I have the same file stored in hadoop compressed as both bzip2 and
> gz formats.  The uncompressed file size is 660,841,894 bytes.  Comparing
> the files they both seem to be valid archives of the exact same file.
>
>
>
> /usr/local/hadoop/bin/hadoop dfs -cat
> bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum
>
> 2c82901170f44245fb04d24ad4746e38  -
>
>
>
> /usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz
> | gunzip -c | md5sum
>
> 2c82901170f44245fb04d24ad4746e38  -
>
>
>
>Given the md5 sums match it seems like the files are the same and
> uncompress correctly.
>
>
>
>Now when I run a simple Map/Reduce application that just counts
> lines in the file I get different results.
>
>
>
>  Expected Results:
>
>
>
>  /usr/local/hadoop/bin/hadoop dfs -cat
> bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l
>
> 6884024
>
>
>
>   Gzip input file Results: 6,884,024
>
>   Bzip2 input file Results: 9,420
>
>
>
>
>
>   Looking at the task log files the MAP_INPUT_BYTES of the .gz file
> looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and
> matches the size of the uncompressed file.  However, looking at
> MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map
> input bytes)(90)] ) which matches the block size of the bzip2
> compressed file.  So that makes me think for some reason that only the
> first bzip2 block of the bzip2 compressed file is being processed.
>
>
>
>So I'm wondering if my analysis is correct and if there could be an
> issue with the processing of bzip2 input files.
>
>
>
>   Andy
>
>


Re: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-04 Thread Alex Loddengaard
Currently in Hadoop you cannot split bzip2 files:



However, gzip files can be split:



Hope this helps.

Alex

On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins <[EMAIL PROTECTED]>wrote:

>
>
>I'm seeing some strange behavior with bzip2 files and release
> 0.19.0.  I'm wondering if anyone can shed some light on what I'm seeing.
> Basically it _looks_ like the processing of a particular bzip2 input
> file is stopping after the first bzip2 block.  Below is a comparison of
> tests  between a .gz file which seems to do what I expect, and the same
> file .bz2 which doesn't behave as I expect.
>
>
>
>I have the same file stored in hadoop compressed as both bzip2 and
> gz formats.  The uncompressed file size is 660,841,894 bytes.  Comparing
> the files they both seem to be valid archives of the exact same file.
>
>
>
> /usr/local/hadoop/bin/hadoop dfs -cat
> bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum
>
> 2c82901170f44245fb04d24ad4746e38  -
>
>
>
> /usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz
> | gunzip -c | md5sum
>
> 2c82901170f44245fb04d24ad4746e38  -
>
>
>
>Given the md5 sums match it seems like the files are the same and
> uncompress correctly.
>
>
>
>Now when I run a simple Map/Reduce application that just counts
> lines in the file I get different results.
>
>
>
>  Expected Results:
>
>
>
>  /usr/local/hadoop/bin/hadoop dfs -cat
> bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l
>
> 6884024
>
>
>
>   Gzip input file Results: 6,884,024
>
>   Bzip2 input file Results: 9,420
>
>
>
>
>
>   Looking at the task log files the MAP_INPUT_BYTES of the .gz file
> looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and
> matches the size of the uncompressed file.  However, looking at
> MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map
> input bytes)(90)] ) which matches the block size of the bzip2
> compressed file.  So that makes me think for some reason that only the
> first bzip2 block of the bzip2 compressed file is being processed.
>
>
>
>So I'm wondering if my analysis is correct and if there could be an
> issue with the processing of bzip2 input files.
>
>
>
>   Andy
>
>


Strange behavior with bzip2 input files w/release 0.19.0

2008-12-04 Thread Andy Sautins
 

I'm seeing some strange behavior with bzip2 files and release
0.19.0.  I'm wondering if anyone can shed some light on what I'm seeing.
Basically it _looks_ like the processing of a particular bzip2 input
file is stopping after the first bzip2 block.  Below is a comparison of
tests  between a .gz file which seems to do what I expect, and the same
file .bz2 which doesn't behave as I expect.

 

I have the same file stored in hadoop compressed as both bzip2 and
gz formats.  The uncompressed file size is 660,841,894 bytes.  Comparing
the files they both seem to be valid archives of the exact same file.  

 

/usr/local/hadoop/bin/hadoop dfs -cat
bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum

2c82901170f44245fb04d24ad4746e38  -

 

/usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz
| gunzip -c | md5sum

2c82901170f44245fb04d24ad4746e38  -

 

Given the md5 sums match it seems like the files are the same and
uncompress correctly. 

 

Now when I run a simple Map/Reduce application that just counts
lines in the file I get different results.  

 

  Expected Results:

 

 /usr/local/hadoop/bin/hadoop dfs -cat
bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l

6884024   

 

   Gzip input file Results: 6,884,024

   Bzip2 input file Results: 9,420

 

 

   Looking at the task log files the MAP_INPUT_BYTES of the .gz file
looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and
matches the size of the uncompressed file.  However, looking at
MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map
input bytes)(90)] ) which matches the block size of the bzip2
compressed file.  So that makes me think for some reason that only the
first bzip2 block of the bzip2 compressed file is being processed.

 

So I'm wondering if my analysis is correct and if there could be an
issue with the processing of bzip2 input files.

 

   Andy