null value output from map...

2009-03-13 Thread Andy Sautins
 

   In writing a Map/Reduce job I ran across something I found a little
strange.  I have a situation where I don't need a value output from map.
If I set the value of the value of OutputCollectorText, IntWritable to
null I get the following exception:

 

java.lang.NullPointerException

   at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:56
2)

 

Looking at the code in MapTask.java ( Hadoop .19.1 ) it makes sense
why it would throw the exception:

 

  if (value.getClass() != valClass) {

throw new IOException(Type mismatch in value from map: expected


  + valClass.getName() + , recieved 

  + value.getClass().getName());

  }

 

  I guess my question is as follows: is it a bad idea/not normal to
collect a null value in map?  Outputting from reduce through
TextOutputFormat with a null value as I expect.  If the value is null
only they key and newline are output.  

 

   Any thoughts would be appreciated.

  

 

   



Best practices on spliltting an input line?

2009-02-10 Thread Andy Sautins
 

   I have question.  I've dabbled with different ways of tokenizing an
input file line for processing.  I've noticed in my somewhat limited
tests that there seem to be some pretty reasonable performance
differences between different tokenizing methods.  For example, roughly
it seems to split a line on tokens ( tab delimited in my case ) that
Scanner is the slowest, followed by String.spit and StringTokenizer
being the fastest.  StringTokenizer, for my application, has the
unfortunate characteristic of not returning blank tokens ( i.e., parsing
a,b,c,,d would return a,b,c,d instead of a,b,c,,d).
The WordCount example uses StringTokenizer which makes sense to me,
except I'm currently getting hung up on not returning blank tokens.  I
did run across the com.Ostermiller.util StringTokenizer replacement that
handles null/blank tokens
(http://ostermiller.org/utils/StringTokenizer.html ) which seems
possible to use, but it sure seems like someone else has solved this
problem already better than I have.

 

   So, my question is, is there a best practice for splitting an input
line especially when NULL tokens are expected ( i.e., two consecutive
delimiter characters )?

 

   Any thoughts would be appreciated

 

   Thanks

 

   Andy



API Documentation question - WritableComparable

2008-12-11 Thread Andy Sautins
 

  I have a question regarding the Hadoop API documentation for .19.  The
question is in regard to:
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Writ
ableComparable.html.  The document shows the following for the compareTo
method:

 

   public int compareTo(MyWritableComparable w) {

 int thisValue = this.value;

 int thatValue = ((IntWritable)o).value;

 return (thisValue  thatValue ? -1 : (thisValue==thatValue ? 0
: 1));

   }

 

 

   Taking the full class example doesn't compile.  What I _think_ would
be right would be:

 

   public int compareTo(Object o) {

 int thisValue = this.value;

 int thatValue = ((MyWritableComparable)o).value;

 return (thisValue  thatValue ? -1 : (thisValue==thatValue ? 0
: 1));

   }

 

  But even at that it's unclear why the compareTo function is comparing
value ( which isn't a member of the class in the example ) and not the
counter and timestamp variables in the class.

 

   Am I understanding this right?  Is there something amiss with the
documentation?

 

   Thanks

 

   Andy

 

 



internal/external interfaces for hadoop...

2008-12-08 Thread Andy Sautins
 

   I'm trying to setup what I think would be a common hadoop
configuration.  I have 4 data nodes on an internal 10.x network.  Each
of the data nodes only has access to the 10.x network.  The name node
has both an internal 10.x network interface and an external interface.
I want the hdfs filesystem and job tracker to be available on the
external network, but the communication within the cluster to be on the
10.x network.  Is this possible to do?  Changing the fs.default.name
configuration parameter I can change the filesystem to listen from the
internal to the external interface, however, then the data nodes can't
communicate to the name node.  I also tried setting the fs.default.name
IP address to 0.0.0.0 to see if it would bind to all interfaces, but
that didn't seem to work.

 

   Is it possible to configure hadoop so that the datanodes communicate
on an internal network, but access to hdfs and the job tracker are done
through an external interface?

 

   Any help would be much appreciated.

 

   Thank you

 

   Andy  



RE: internal/external interfaces for hadoop...

2008-12-08 Thread Andy Sautins

  Ah.  Thanks.  That makes what I was trying to do sound rather
ridiculous now, does it.

  I appreciate the insight.

  Thanks

  Andy

-Original Message-
From: Taeho Kang [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 08, 2008 6:10 PM
To: core-user@hadoop.apache.org
Subject: Re: internal/external interfaces for hadoop...

When reading from or writing to a file on HDFS, datablocks never go thru
the
namenode. They are directly handled/transferred between your client and
the
datanodes that contain the blocks.

 Hence, datanodes must be accessible by your client. In this case since
your
client is on an external network, your datanodes must be accessible to
external networks.


On Tue, Dec 9, 2008 at 8:25 AM, Andy Sautins
[EMAIL PROTECTED]wrote:



   I'm trying to setup what I think would be a common hadoop
 configuration.  I have 4 data nodes on an internal 10.x network.  Each
 of the data nodes only has access to the 10.x network.  The name node
 has both an internal 10.x network interface and an external interface.
 I want the hdfs filesystem and job tracker to be available on the
 external network, but the communication within the cluster to be on
the
 10.x network.  Is this possible to do?  Changing the fs.default.name
 configuration parameter I can change the filesystem to listen from the
 internal to the external interface, however, then the data nodes can't
 communicate to the name node.  I also tried setting the
fs.default.name
 IP address to 0.0.0.0 to see if it would bind to all interfaces, but
 that didn't seem to work.



   Is it possible to configure hadoop so that the datanodes communicate
 on an internal network, but access to hdfs and the job tracker are
done
 through an external interface?



   Any help would be much appreciated.



   Thank you



   Andy




Can mapper get access to filename being processed?

2008-12-07 Thread Andy Sautins
 

   I'm having trouble finding a way to do what I want, so I'm wondering
if I'm just not looking at the right place or if I'm thinking about the
problem in the wrong way.  Any insight would be appreciated.

 

   Let's say I have a directory of files that contains a combination of
different file types.  The MapReduce job needs to process all files in
the directory but generates different key/value pairs depending on the
file being processed.  What I'd like to do is use the filename to
identify the file type being processed and use that information in the
map job.  What it seems like what I'd want is the map job to have access
to the filename of the input file split being processed.  I haven't been
able to find out if that is available to a derived class of
MapReduceBase.  

 

   Does what I'm trying to do make sense or is there a better way of
processing a job like the one I'm describing?

 

   Thank you

 

   Andy

   

 





RE: Can mapper get access to filename being processed?

2008-12-07 Thread Andy Sautins

  Thanks.  map.input.file is exactly what I need. 

  One more question.  Is there a way to ignore a file in an input path?
So, for example, if the data in hadoop is stored in a directory
structure /date/machine.txt.  So let's say Dec 1, 2008, I have a
file from machine a and b, I would have the following directory
structure:

   /20081201/a.txt
   /20081201/b.txt

   What I'd like to do is have a job that, depending on the
configuration, would either process all files or files for a given
machine only ( say a, but not b ).  

   Is that possible to do or am I trying to do something that's using
Hadoop in a way that it's not intended to be used?  I looked briefly at
MultipleInputs which seems to be able to handle different input paths,
but not handle a single input path in different ways depending on
filename.

   Thanks again.

   Andy

-Original Message-
From: Devaraj Das [mailto:[EMAIL PROTECTED] 
Sent: Sunday, December 07, 2008 12:11 PM
To: core-user@hadoop.apache.org
Subject: Re: Can mapper get access to filename being processed?




On 12/7/08 11:32 PM, Andy Sautins [EMAIL PROTECTED] wrote:

  
 
I'm having trouble finding a way to do what I want, so I'm
wondering
 if I'm just not looking at the right place or if I'm thinking about
the
 problem in the wrong way.  Any insight would be appreciated.
 
  
 
Let's say I have a directory of files that contains a combination
of
 different file types.  The MapReduce job needs to process all files in
 the directory but generates different key/value pairs depending on the
 file being processed.  What I'd like to do is use the filename to
 identify the file type being processed and use that information in the
 map job.  What it seems like what I'd want is the map job to have
access
 to the filename of the input file split being processed.  I haven't
been
 able to find out if that is available to a derived class of
 MapReduceBase.  
 
 
That's map.input.file available in the map via JobConf. The mapper class
has
to override the implementation of configure in MapReduceBase and get the
filename via JobConf.get(map.input.file). Store that in some field
variable of your mapper class. You can then inspect that in your map
method.

 
Does what I'm trying to do make sense or is there a better way of
 processing a job like the one I'm describing?
 

Look at MultipleInputs class (in the mapred.lib directory). That could
prove
useful.  
 
Thank you
 
  
 
Andy
 

 
  
 
 
 




RE: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-06 Thread Andy Sautins

   Abdul,

   Please note that I applied patch 4012 version 4 to release 0.19.0 and
re-ran tests with mixed results.  My simple test ( 20 million simple
records ) for both pbzip2/bzip2 generated the same correct results which
is great.  However, a larger test case ( described in more detail below
) had a discrepancy in the results when compared to gzip and plain text
files. bzip2/gzip/text all had produced the same results pre-patch.  The
bzip2 run had 3 additional records compared to the text/gzip runs post
patch.

   The following are timings and results for a sample dataset running a
simple MapReduce job ( MapReduce version of unix 'wc' ).  Note the
dataset consists of 11 files that are a total of 27G uncompressed, 4.5G
gzip compressed and 3.1G bzip2 compressed.  All 3 datasets are identical
and produce the same md5sum.  Also the bzip2 files in the test were
compressed using bzip2, not pbzip2.

Release .19.0 Pre patch:
   TypeTiming   MapReduce Result   
--
   Gzip  - 4m55s323,234,098
   Bzip2 - 16m14s   323,234,098
   Txt   - 6m23s323,234,098

Release .19.0 Post patch 4012 Version 4 ( w /results )
   TypeTiming   MapReduce Result   
--
   Gzip  - 5m14s332,234,098
   Bzip2 - 9m36s332,234,101
   Txt   - 6m28s332.234.098

   Both Gzip/Txt timings were about the same between runs.  Bzip2
elapsed time was reduced significantly.

   So, generally positive although looks like there might be an
edge-case causing slightly different results.  I'll work on putting
together a test case of manageable size that re-produces the result
discrepancy.  

   Thanks again for the help.

   Andy

-Original Message-
From: Andy Sautins [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 04, 2008 2:29 PM
To: core-user@hadoop.apache.org
Subject: RE: Strange behavior with bzip2 input files w/release 0.19.0


   Thanks Abdul.  Very exciting that hadoop will soon be able to handle
not only pbzip2 files but also be able to split bzip2 files.  

   I will apply the patch and report back. 

   Thank you

   Andy

-Original Message-
From: Abdul Qadeer [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 04, 2008 1:49 PM
To: core-user@hadoop.apache.org
Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0

Andy,

As was mentioned earlier that splitting support is being added for bzip2
files
and actually patch is under review now.  I think, pbzip2 generated files
should
work fine with that because the split algorithm finds the next start of
block
marker and does not use end of stream marker.  We rather use physical
end of file to know when stream ends.


So if you see at https://issues.apache.org/jira/browse/HADOOP-4012
you can download version 4 patch and apply it on Hadoop code and see
if its working for you or you can wait for the review process to
complete
so that code becomes a part of standard Hadoop.  You can add yourself
as a watcher there at JIRA 4012, so that you know when its done.  Please
let me know, if pbzip2 generated files does not work even on that code.

Thank you,
Abdul Qadeer


On Thu, Dec 4, 2008 at 11:46 AM, Andy Sautins
[EMAIL PROTECTED]wrote:


   Thanks for the response Abdul.

   So, the bzip2 file in question is _kindof_ a concatenation of
 multiple bzip2 files.  It's not concatenated using cat a.bz2 b.bz2 
 yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running
on
 CentOS 5.2 installed from the EPEL repository ).  My understanding is
 that pbzip does roughly what you're saying and concatenates in some
 manner.

   I created a simple test case that reproduces the behavior.  I
created
 a file using the following perl script:

 for($i=0;$i2000;$i++) {
  print Line $i\n;
 }

I then created two different bzip2 files.  One with bzip2 and one
 with pbzip2.  The do have different sizes:

 21994233 simple.bzip2.txt.bz2
 21999416 simple.pbzip2.txt.bz2

They do decompress to give the same output file
 bunzip2 -c simple.bzip2.txt.bz2 | md5sum
 581ad242e6cf22650072edd44d6a2d38  -

 bunzip2 -c simple.pbzip2.txt.bz2 | md5sum
 581ad242e6cf22650072edd44d6a2d38  -

   Running both through the simple line count MapReduce job I get the
 same behavior where bzip2 correctly calculates 20,000,000 records, but
 the pbzip2 generated file only processes the first block ( 82,829
 records ).

   So, it sounds like what you're saying of having multiple end of
 stream markers makes sense.  I will say it would be very beneficial to
 be able to use pbzip2 generated files to compress hadoop input files.
 Using pbzip2 can greatly reduce the amount of time required to bzip2
 compress files and seems to generate a valid bzip2 file ( at least it
 bunzip2 decompresses correctly ).

   Thank you

   Andy

 -Original Message-
 From: Abdul Qadeer [mailto:[EMAIL PROTECTED]
 Sent: Thursday, December 04, 2008 12:07 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Strange behavior with bzip2

Strange behavior with bzip2 input files w/release 0.19.0

2008-12-04 Thread Andy Sautins
 

I'm seeing some strange behavior with bzip2 files and release
0.19.0.  I'm wondering if anyone can shed some light on what I'm seeing.
Basically it _looks_ like the processing of a particular bzip2 input
file is stopping after the first bzip2 block.  Below is a comparison of
tests  between a .gz file which seems to do what I expect, and the same
file .bz2 which doesn't behave as I expect.

 

I have the same file stored in hadoop compressed as both bzip2 and
gz formats.  The uncompressed file size is 660,841,894 bytes.  Comparing
the files they both seem to be valid archives of the exact same file.  

 

/usr/local/hadoop/bin/hadoop dfs -cat
bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum

2c82901170f44245fb04d24ad4746e38  -

 

/usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz
| gunzip -c | md5sum

2c82901170f44245fb04d24ad4746e38  -

 

Given the md5 sums match it seems like the files are the same and
uncompress correctly. 

 

Now when I run a simple Map/Reduce application that just counts
lines in the file I get different results.  

 

  Expected Results:

 

 /usr/local/hadoop/bin/hadoop dfs -cat
bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l

6884024   

 

   Gzip input file Results: 6,884,024

   Bzip2 input file Results: 9,420

 

 

   Looking at the task log files the MAP_INPUT_BYTES of the .gz file
looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and
matches the size of the uncompressed file.  However, looking at
MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map
input bytes)(90)] ) which matches the block size of the bzip2
compressed file.  So that makes me think for some reason that only the
first bzip2 block of the bzip2 compressed file is being processed.

 

So I'm wondering if my analysis is correct and if there could be an
issue with the processing of bzip2 input files.

 

   Andy 



RE: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-04 Thread Andy Sautins

   Thanks for the response Abdul.

   So, the bzip2 file in question is _kindof_ a concatenation of
multiple bzip2 files.  It's not concatenated using cat a.bz2 b.bz2 
yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running on
CentOS 5.2 installed from the EPEL repository ).  My understanding is
that pbzip does roughly what you're saying and concatenates in some
manner.

   I created a simple test case that reproduces the behavior.  I created
a file using the following perl script:

for($i=0;$i2000;$i++) {
  print Line $i\n;
}

I then created two different bzip2 files.  One with bzip2 and one
with pbzip2.  The do have different sizes:

21994233 simple.bzip2.txt.bz2
21999416 simple.pbzip2.txt.bz2

They do decompress to give the same output file
bunzip2 -c simple.bzip2.txt.bz2 | md5sum
581ad242e6cf22650072edd44d6a2d38  -

bunzip2 -c simple.pbzip2.txt.bz2 | md5sum
581ad242e6cf22650072edd44d6a2d38  -

   Running both through the simple line count MapReduce job I get the
same behavior where bzip2 correctly calculates 20,000,000 records, but
the pbzip2 generated file only processes the first block ( 82,829
records ).  

   So, it sounds like what you're saying of having multiple end of
stream markers makes sense.  I will say it would be very beneficial to
be able to use pbzip2 generated files to compress hadoop input files.
Using pbzip2 can greatly reduce the amount of time required to bzip2
compress files and seems to generate a valid bzip2 file ( at least it
bunzip2 decompresses correctly ).

   Thank you

   Andy

-Original Message-
From: Abdul Qadeer [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 04, 2008 12:07 PM
To: core-user@hadoop.apache.org
Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0

Andy,

As you said, you suspect that only one bzip2 block is being decompressed
and used; is you bzip2 file the concatenation of multiple bzip2 files
(i.e.
are
you doing something like cat a.bz2 b.bz2 c.bz2  yourFile.bz2 ?)  In
such
a case, there will be many bzip2 end of stream markers in a single file
and
bzip2 decomprssor will stop on encountering the first end of block
marker
when in fact, the stream has more data in it.

If this is not the case, then bzip2 should work as gzip or plaintext are
working.
Currently only one mapper gets the whole file (just like gzip and
splitting
support
for bzip is being added in HADOOP-4012, as Alex mentioned).  The
LineRecordReader
get the uncompressed data and does rest of the things same as in the
case
of gzip or plaintext.  So can you provide your bzip2 compressed file?
(May
be
uploading it somewhere and sending in the link)  I will look into this
issue.


Abdul Qadeer

On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins
[EMAIL PROTECTED]wrote:



I'm seeing some strange behavior with bzip2 files and release
 0.19.0.  I'm wondering if anyone can shed some light on what I'm
seeing.
 Basically it _looks_ like the processing of a particular bzip2 input
 file is stopping after the first bzip2 block.  Below is a comparison
of
 tests  between a .gz file which seems to do what I expect, and the
same
 file .bz2 which doesn't behave as I expect.



I have the same file stored in hadoop compressed as both bzip2 and
 gz formats.  The uncompressed file size is 660,841,894 bytes.
Comparing
 the files they both seem to be valid archives of the exact same file.



 /usr/local/hadoop/bin/hadoop dfs -cat
 bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum

 2c82901170f44245fb04d24ad4746e38  -



 /usr/local/hadoop/bin/hadoop dfs -cat
bzip2.example/data.gz/file.txt.gz
 | gunzip -c | md5sum

 2c82901170f44245fb04d24ad4746e38  -



Given the md5 sums match it seems like the files are the same and
 uncompress correctly.



Now when I run a simple Map/Reduce application that just counts
 lines in the file I get different results.



  Expected Results:



  /usr/local/hadoop/bin/hadoop dfs -cat
 bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l

 6884024



   Gzip input file Results: 6,884,024

   Bzip2 input file Results: 9,420





   Looking at the task log files the MAP_INPUT_BYTES of the .gz file
 looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and
 matches the size of the uncompressed file.  However, looking at
 MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map
 input bytes)(90)] ) which matches the block size of the bzip2
 compressed file.  So that makes me think for some reason that only the
 first bzip2 block of the bzip2 compressed file is being processed.



So I'm wondering if my analysis is correct and if there could be an
 issue with the processing of bzip2 input files.



   Andy




RE: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-04 Thread Andy Sautins

   Thanks Abdul.  Very exciting that hadoop will soon be able to handle
not only pbzip2 files but also be able to split bzip2 files.  

   I will apply the patch and report back. 

   Thank you

   Andy

-Original Message-
From: Abdul Qadeer [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 04, 2008 1:49 PM
To: core-user@hadoop.apache.org
Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0

Andy,

As was mentioned earlier that splitting support is being added for bzip2
files
and actually patch is under review now.  I think, pbzip2 generated files
should
work fine with that because the split algorithm finds the next start of
block
marker and does not use end of stream marker.  We rather use physical
end of file to know when stream ends.


So if you see at https://issues.apache.org/jira/browse/HADOOP-4012
you can download version 4 patch and apply it on Hadoop code and see
if its working for you or you can wait for the review process to
complete
so that code becomes a part of standard Hadoop.  You can add yourself
as a watcher there at JIRA 4012, so that you know when its done.  Please
let me know, if pbzip2 generated files does not work even on that code.

Thank you,
Abdul Qadeer


On Thu, Dec 4, 2008 at 11:46 AM, Andy Sautins
[EMAIL PROTECTED]wrote:


   Thanks for the response Abdul.

   So, the bzip2 file in question is _kindof_ a concatenation of
 multiple bzip2 files.  It's not concatenated using cat a.bz2 b.bz2 
 yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running
on
 CentOS 5.2 installed from the EPEL repository ).  My understanding is
 that pbzip does roughly what you're saying and concatenates in some
 manner.

   I created a simple test case that reproduces the behavior.  I
created
 a file using the following perl script:

 for($i=0;$i2000;$i++) {
  print Line $i\n;
 }

I then created two different bzip2 files.  One with bzip2 and one
 with pbzip2.  The do have different sizes:

 21994233 simple.bzip2.txt.bz2
 21999416 simple.pbzip2.txt.bz2

They do decompress to give the same output file
 bunzip2 -c simple.bzip2.txt.bz2 | md5sum
 581ad242e6cf22650072edd44d6a2d38  -

 bunzip2 -c simple.pbzip2.txt.bz2 | md5sum
 581ad242e6cf22650072edd44d6a2d38  -

   Running both through the simple line count MapReduce job I get the
 same behavior where bzip2 correctly calculates 20,000,000 records, but
 the pbzip2 generated file only processes the first block ( 82,829
 records ).

   So, it sounds like what you're saying of having multiple end of
 stream markers makes sense.  I will say it would be very beneficial to
 be able to use pbzip2 generated files to compress hadoop input files.
 Using pbzip2 can greatly reduce the amount of time required to bzip2
 compress files and seems to generate a valid bzip2 file ( at least it
 bunzip2 decompresses correctly ).

   Thank you

   Andy

 -Original Message-
 From: Abdul Qadeer [mailto:[EMAIL PROTECTED]
 Sent: Thursday, December 04, 2008 12:07 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0

 Andy,

 As you said, you suspect that only one bzip2 block is being
decompressed
 and used; is you bzip2 file the concatenation of multiple bzip2 files
 (i.e.
 are
 you doing something like cat a.bz2 b.bz2 c.bz2  yourFile.bz2 ?)  In
 such
 a case, there will be many bzip2 end of stream markers in a single
file
 and
 bzip2 decomprssor will stop on encountering the first end of block
 marker
 when in fact, the stream has more data in it.

 If this is not the case, then bzip2 should work as gzip or plaintext
are
 working.
 Currently only one mapper gets the whole file (just like gzip and
 splitting
 support
 for bzip is being added in HADOOP-4012, as Alex mentioned).  The
 LineRecordReader
 get the uncompressed data and does rest of the things same as in the
 case
 of gzip or plaintext.  So can you provide your bzip2 compressed file?
 (May
 be
 uploading it somewhere and sending in the link)  I will look into this
 issue.


 Abdul Qadeer

 On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins
 [EMAIL PROTECTED]wrote:

 
 
 I'm seeing some strange behavior with bzip2 files and release
  0.19.0.  I'm wondering if anyone can shed some light on what I'm
 seeing.
  Basically it _looks_ like the processing of a particular bzip2 input
  file is stopping after the first bzip2 block.  Below is a comparison
 of
  tests  between a .gz file which seems to do what I expect, and the
 same
  file .bz2 which doesn't behave as I expect.
 
 
 
 I have the same file stored in hadoop compressed as both bzip2
and
  gz formats.  The uncompressed file size is 660,841,894 bytes.
 Comparing
  the files they both seem to be valid archives of the exact same
file.
 
 
 
  /usr/local/hadoop/bin/hadoop dfs -cat
  bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum
 
  2c82901170f44245fb04d24ad4746e38  -
 
 
 
  /usr/local/hadoop/bin/hadoop dfs -cat
 bzip2.example/data.gz/file.txt.gz