Re: Input split for a streaming job!

2011-11-14 Thread Raj V
MIlind

I realised that thankls to Joey from Cloudera. I have given up on bzip.

Raj



>
>From: "milind.bhandar...@emc.com" 
>To: common-user@hadoop.apache.org; rajv...@yahoo.com; cdh-u...@cloudera.org
>Sent: Monday, November 14, 2011 2:02 PM
>Subject: Re: Input split for a streaming job!
>
>It looks like your hadoop distro does not have
>https://issues.apache.org/jira/browse/HADOOP-4012.
>
>- milind
>
>On 11/10/11 2:40 PM, "Raj V"  wrote:
>
>>All
>>
>>I assumed that the input splits for a streaming job will follow the same
>>logic as a map reduce java job but I seem to be wrong.
>>
>>I started out with 73 gzipped files that vary between 23MB to 255MB in
>>size. My default block size was 128MB.  8 of the 73 files are larger than
>>128 MB
>>
>>When I ran my streaming job, it ran, as expected,  73 mappers ( No
>>reducers for this job).
>>
>>Since I have 128 Nodes in my cluster , I thought I would use more systems
>>in the cluster by increasing the number of mappers. I changed all the
>>gzip files into bzip2 files. I expected the number of mappers to increase
>>to 81. The mappers remained at 73.
>>
>>I tried a second experiment- I changed my dfs.block.size to 32MB. That
>>should have increased my mappers to about ~250. It remains steadfast at
>>73.
>>
>>Is my understanding wrong? With a smaller block size and bzipped files,
>>should I not get more mappers?
>>
>>Raj
>
>
>
>

Re: Input split for a streaming job!

2011-11-14 Thread Milind.Bhandarkar
It looks like your hadoop distro does not have
https://issues.apache.org/jira/browse/HADOOP-4012.

- milind

On 11/10/11 2:40 PM, "Raj V"  wrote:

>All
>
>I assumed that the input splits for a streaming job will follow the same
>logic as a map reduce java job but I seem to be wrong.
>
>I started out with 73 gzipped files that vary between 23MB to 255MB in
>size. My default block size was 128MB.  8 of the 73 files are larger than
>128 MB
>
>When I ran my streaming job, it ran, as expected,  73 mappers ( No
>reducers for this job).
>
>Since I have 128 Nodes in my cluster , I thought I would use more systems
>in the cluster by increasing the number of mappers. I changed all the
>gzip files into bzip2 files. I expected the number of mappers to increase
>to 81. The mappers remained at 73.
>
>I tried a second experiment- I changed my dfs.block.size to 32MB. That
>should have increased my mappers to about ~250. It remains steadfast at
>73.
>
>Is my understanding wrong? With a smaller block size and bzipped files,
>should I not get more mappers?
>
>Raj



RE: Input split for a streaming job!

2011-11-11 Thread Tim Broberg
Or you could use the LZO patch and get *fast* splittable compression that 
doesn't depend on the bz2 generalized splittability scheme:

http://www.cloudera.com/blog/2009/06/parallel-lzo-splittable-compression-for-hadoop/
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

- Tim.

From: bejoy.had...@gmail.com [bejoy.had...@gmail.com]
Sent: Friday, November 11, 2011 10:44 AM
To: common-user@hadoop.apache.org; Raj V; Tim Broberg
Subject: Re: Input split for a streaming job!

Hi Raj
   AFAIK 0.21is an unstable release and I fear anyone would recommend that 
for production. You can play around with the same, a better approach would be 
patching your CDH3u1 with the required patches for splittable BZip2, but make 
sure that your new patch doesn't break anything else.

Regards
Bejoy K S

-Original Message-
From: Raj V 
Date: Fri, 11 Nov 2011 10:34:18
To: Tim Broberg; 
common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Re: Input split for a streaming job!

Tim

I  am using CDH3 U1. ( 0.20.2+923) which does not have the patch.

I will try and use 0.21

Raj



>
>From: Tim Broberg 
>To: "common-user@hadoop.apache.org" ; Raj V 
>; Joey Echeverria 
>Sent: Friday, November 11, 2011 10:25 AM
>Subject: RE: Input split for a streaming job!
>
>
>
>What version of hadoop are you using?
>
>We just stumbled on the Jira item for BZIP2 splitting, and it appears to have 
>been added in 0.21.
>
>When I diff 0.20.205 vs trunk, I see
>< public class BZip2Codec implements
>>< org.apache.hadoop.io.compress.CompressionCodec {
>>---
>>> @InterfaceAudience.Public
>>> @InterfaceStability.Evolving
>>> public class BZip2Codec implements SplittableCompressionCodec {
>So, it appears you need at least 0.21 to play with splittability in BZIP2.
>
> - Tim.
>
>
>From: Raj V [rajv...@yahoo.com]
>Sent: Friday, November 11, 2011 9:18 AM
>To: Joey Echeverria
>Cc: common-user@hadoop.apache.org
>Subject: Re: Input split for a streaming job!
>
>Joey,Anirudh, Bejoy
>
>I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat).
>
>And the input files were created using 32MB block size and the files are bzip2.
>
>So all things point to my input files being spliitable.
>
>I  will continue poking around.
>
>- best regards
>
>Raj
>
>
>
>>
>>From: Joey Echeverria 
>>To: Raj V 
>>Sent: Friday, November 11, 2011 2:56 AM
>>Subject: Re: Input split for a streaming job!
>>
>>U1 should be able to split the bzip2 files. What input format are you using?
>>
>>-Joey
>>
>>On Thu, Nov 10, 2011 at 9:06 PM, Raj V  wrote:
>>> Sorry to bother you offline.
>>> From the release notes for CDH3U1
>>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html)
>>> I understand that split of the bzip files was available.
>>> But returning to my old problem I still see 73 mappers. Did I misunderstand
>>> something?
>>> If necessary, I can re-post the mail to the group.,
>>>
>>> 
>>> From: Joey Echeverria 
>>> To: rajv...@yahoo.com
>>> Sent: Thursday, November 10, 2011 3:11 PM
>>> Subject: Re: Input split for a streaming job!
>>>
>>> No problem. Out of curiosity, why are you still using B3?
>>>
>>> -Joey
>>>
>>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V  wrote:
>>>> Joey
>>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does
>>>> not
>>>> seem to support bzip splitting. I should have looked before shooting off
>>>> the
>>>> email :-(
>>>> To answer your second question, I created a completely new set of input
>>>> files with dfs.block.size=32MB and used this as the input data
>>>> Raj
>>>>
>>>>
>>>> 
>>>> From: Joey Echeverria 
>>>> To: cdh-u...@cloudera.org
>>>> Sent: Thursday, November 10, 2011 3:02 PM
>>>> Subject: Re: Input split for a streaming job!
>>>>
>>>> It depends on the version of hadoop that you're using. Also, when you
>>>> changed the block size, did you do it on the actual files, or just the
>>>> default for new files?
>>>>
>>>> -Joey
>>>>
>>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V  wrote:
>>>>

Re: Input split for a streaming job!

2011-11-11 Thread bejoy . hadoop
Hi Raj
   AFAIK 0.21is an unstable release and I fear anyone would recommend that 
for production. You can play around with the same, a better approach would be 
patching your CDH3u1 with the required patches for splittable BZip2, but make 
sure that your new patch doesn't break anything else.
 
Regards
Bejoy K S

-Original Message-
From: Raj V 
Date: Fri, 11 Nov 2011 10:34:18 
To: Tim Broberg; 
common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Re: Input split for a streaming job!

Tim

I  am using CDH3 U1. ( 0.20.2+923) which does not have the patch.

I will try and use 0.21

Raj



>
>From: Tim Broberg 
>To: "common-user@hadoop.apache.org" ; Raj V 
>; Joey Echeverria 
>Sent: Friday, November 11, 2011 10:25 AM
>Subject: RE: Input split for a streaming job!
>
>
> 
>What version of hadoop are you using?
> 
>We just stumbled on the Jira item for BZIP2 splitting, and it appears to have 
>been added in 0.21.
> 
>When I diff 0.20.205 vs trunk, I see
>< public class BZip2Codec implements
>>< org.apache.hadoop.io.compress.CompressionCodec {
>>---
>>> @InterfaceAudience.Public
>>> @InterfaceStability.Evolving
>>> public class BZip2Codec implements SplittableCompressionCodec {
>So, it appears you need at least 0.21 to play with splittability in BZIP2. 
> 
> - Tim.
>
>
>From: Raj V [rajv...@yahoo.com]
>Sent: Friday, November 11, 2011 9:18 AM
>To: Joey Echeverria
>Cc: common-user@hadoop.apache.org
>Subject: Re: Input split for a streaming job!
>
>Joey,Anirudh, Bejoy
>
>I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat).
>
>And the input files were created using 32MB block size and the files are bzip2.
>
>So all things point to my input files being spliitable.
>
>I  will continue poking around.
>
>- best regards
>
>Raj
>
>
>
>>
>>From: Joey Echeverria 
>>To: Raj V 
>>Sent: Friday, November 11, 2011 2:56 AM
>>Subject: Re: Input split for a streaming job!
>>
>>U1 should be able to split the bzip2 files. What input format are you using?
>>
>>-Joey
>>
>>On Thu, Nov 10, 2011 at 9:06 PM, Raj V  wrote:
>>> Sorry to bother you offline.
>>> From the release notes for CDH3U1
>>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html)
>>> I understand that split of the bzip files was available.
>>> But returning to my old problem I still see 73 mappers. Did I misunderstand
>>> something?
>>> If necessary, I can re-post the mail to the group.,
>>>
>>> 
>>> From: Joey Echeverria 
>>> To: rajv...@yahoo.com
>>> Sent: Thursday, November 10, 2011 3:11 PM
>>> Subject: Re: Input split for a streaming job!
>>>
>>> No problem. Out of curiosity, why are you still using B3?
>>>
>>> -Joey
>>>
>>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V  wrote:
>>>> Joey
>>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does
>>>> not
>>>> seem to support bzip splitting. I should have looked before shooting off
>>>> the
>>>> email :-(
>>>> To answer your second question, I created a completely new set of input
>>>> files with dfs.block.size=32MB and used this as the input data
>>>> Raj
>>>>
>>>>
>>>> 
>>>> From: Joey Echeverria 
>>>> To: cdh-u...@cloudera.org
>>>> Sent: Thursday, November 10, 2011 3:02 PM
>>>> Subject: Re: Input split for a streaming job!
>>>>
>>>> It depends on the version of hadoop that you're using. Also, when you
>>>> changed the block size, did you do it on the actual files, or just the
>>>> default for new files?
>>>>
>>>> -Joey
>>>>
>>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V  wrote:
>>>>> Hi Joey,
>>>>> I always thought bzip was splittable.
>>>>> Raj
>>>>>
>>>>> 
>>>>> From: Joey Echeverria 
>>>>> To: cdh-u...@cloudera.org
>>>>> Sent: Thursday, November 10, 2011 2:43 PM
>>>>> Subject: Re: Input split for a streaming job!
>>>>>
>>>>> Gzip and bzip2 compressed files aren't splittable, so you'll always
>>>>> get one mapper pe

Re: Input split for a streaming job!

2011-11-11 Thread Raj V
Tim

I  am using CDH3 U1. ( 0.20.2+923) which does not have the patch.

I will try and use 0.21

Raj



>
>From: Tim Broberg 
>To: "common-user@hadoop.apache.org" ; Raj V 
>; Joey Echeverria 
>Sent: Friday, November 11, 2011 10:25 AM
>Subject: RE: Input split for a streaming job!
>
>
> 
>What version of hadoop are you using?
> 
>We just stumbled on the Jira item for BZIP2 splitting, and it appears to have 
>been added in 0.21.
> 
>When I diff 0.20.205 vs trunk, I see
>< public class BZip2Codec implements
>>< org.apache.hadoop.io.compress.CompressionCodec {
>>---
>>> @InterfaceAudience.Public
>>> @InterfaceStability.Evolving
>>> public class BZip2Codec implements SplittableCompressionCodec {
>So, it appears you need at least 0.21 to play with splittability in BZIP2. 
> 
> - Tim.
>
>
>From: Raj V [rajv...@yahoo.com]
>Sent: Friday, November 11, 2011 9:18 AM
>To: Joey Echeverria
>Cc: common-user@hadoop.apache.org
>Subject: Re: Input split for a streaming job!
>
>Joey,Anirudh, Bejoy
>
>I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat).
>
>And the input files were created using 32MB block size and the files are bzip2.
>
>So all things point to my input files being spliitable.
>
>I  will continue poking around.
>
>- best regards
>
>Raj
>
>
>
>>
>>From: Joey Echeverria 
>>To: Raj V 
>>Sent: Friday, November 11, 2011 2:56 AM
>>Subject: Re: Input split for a streaming job!
>>
>>U1 should be able to split the bzip2 files. What input format are you using?
>>
>>-Joey
>>
>>On Thu, Nov 10, 2011 at 9:06 PM, Raj V  wrote:
>>> Sorry to bother you offline.
>>> From the release notes for CDH3U1
>>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html)
>>> I understand that split of the bzip files was available.
>>> But returning to my old problem I still see 73 mappers. Did I misunderstand
>>> something?
>>> If necessary, I can re-post the mail to the group.,
>>>
>>> 
>>> From: Joey Echeverria 
>>> To: rajv...@yahoo.com
>>> Sent: Thursday, November 10, 2011 3:11 PM
>>> Subject: Re: Input split for a streaming job!
>>>
>>> No problem. Out of curiosity, why are you still using B3?
>>>
>>> -Joey
>>>
>>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V  wrote:
>>>> Joey
>>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does
>>>> not
>>>> seem to support bzip splitting. I should have looked before shooting off
>>>> the
>>>> email :-(
>>>> To answer your second question, I created a completely new set of input
>>>> files with dfs.block.size=32MB and used this as the input data
>>>> Raj
>>>>
>>>>
>>>> 
>>>> From: Joey Echeverria 
>>>> To: cdh-u...@cloudera.org
>>>> Sent: Thursday, November 10, 2011 3:02 PM
>>>> Subject: Re: Input split for a streaming job!
>>>>
>>>> It depends on the version of hadoop that you're using. Also, when you
>>>> changed the block size, did you do it on the actual files, or just the
>>>> default for new files?
>>>>
>>>> -Joey
>>>>
>>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V  wrote:
>>>>> Hi Joey,
>>>>> I always thought bzip was splittable.
>>>>> Raj
>>>>>
>>>>> 
>>>>> From: Joey Echeverria 
>>>>> To: cdh-u...@cloudera.org
>>>>> Sent: Thursday, November 10, 2011 2:43 PM
>>>>> Subject: Re: Input split for a streaming job!
>>>>>
>>>>> Gzip and bzip2 compressed files aren't splittable, so you'll always
>>>>> get one mapper per file.
>>>>>
>>>>> -Joey
>>>>>
>>>>> On Thu, Nov 10, 2011 at 5:40 PM, Raj V  wrote:
>>>>>> All
>>>>>> I assumed that the input splits for a streaming job will follow the same
>>>>>> logic as a map reduce java job but I seem to be wrong.
>>>>>> I started out with 73 gzipped files that vary between 23MB to 255MB in
>>>>>> size.
>>>>>> My default block size was

RE: Input split for a streaming job!

2011-11-11 Thread Tim Broberg
What version of hadoop are you using?

We just stumbled on the Jira item for BZIP2 splitting, and it appears to have 
been added in 0.21.

When I diff 0.20.205 vs trunk, I see

< public class BZip2Codec implements
< org.apache.hadoop.io.compress.CompressionCodec {
---
> @InterfaceAudience.Public
> @InterfaceStability.Evolving
> public class BZip2Codec implements SplittableCompressionCodec {

So, it appears you need at least 0.21 to play with splittability in BZIP2.

 - Tim.


From: Raj V [rajv...@yahoo.com]
Sent: Friday, November 11, 2011 9:18 AM
To: Joey Echeverria
Cc: common-user@hadoop.apache.org
Subject: Re: Input split for a streaming job!

Joey,Anirudh, Bejoy

I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat).

And the input files were created using 32MB block size and the files are bzip2.

So all things point to my input files being spliitable.

I  will continue poking around.

- best regards

Raj



>
>From: Joey Echeverria 
>To: Raj V 
>Sent: Friday, November 11, 2011 2:56 AM
>Subject: Re: Input split for a streaming job!
>
>U1 should be able to split the bzip2 files. What input format are you using?
>
>-Joey
>
>On Thu, Nov 10, 2011 at 9:06 PM, Raj V  wrote:
>> Sorry to bother you offline.
>> From the release notes for CDH3U1
>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html)
>> I understand that split of the bzip files was available.
>> But returning to my old problem I still see 73 mappers. Did I misunderstand
>> something?
>> If necessary, I can re-post the mail to the group.,
>>
>> 
>> From: Joey Echeverria 
>> To: rajv...@yahoo.com
>> Sent: Thursday, November 10, 2011 3:11 PM
>> Subject: Re: Input split for a streaming job!
>>
>> No problem. Out of curiosity, why are you still using B3?
>>
>> -Joey
>>
>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V  wrote:
>>> Joey
>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does
>>> not
>>> seem to support bzip splitting. I should have looked before shooting off
>>> the
>>> email :-(
>>> To answer your second question, I created a completely new set of input
>>> files with dfs.block.size=32MB and used this as the input data
>>> Raj
>>>
>>>
>>> 
>>> From: Joey Echeverria 
>>> To: cdh-u...@cloudera.org
>>> Sent: Thursday, November 10, 2011 3:02 PM
>>> Subject: Re: Input split for a streaming job!
>>>
>>> It depends on the version of hadoop that you're using. Also, when you
>>> changed the block size, did you do it on the actual files, or just the
>>> default for new files?
>>>
>>> -Joey
>>>
>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V  wrote:
>>>> Hi Joey,
>>>> I always thought bzip was splittable.
>>>> Raj
>>>>
>>>> 
>>>> From: Joey Echeverria 
>>>> To: cdh-u...@cloudera.org
>>>> Sent: Thursday, November 10, 2011 2:43 PM
>>>> Subject: Re: Input split for a streaming job!
>>>>
>>>> Gzip and bzip2 compressed files aren't splittable, so you'll always
>>>> get one mapper per file.
>>>>
>>>> -Joey
>>>>
>>>> On Thu, Nov 10, 2011 at 5:40 PM, Raj V  wrote:
>>>>> All
>>>>> I assumed that the input splits for a streaming job will follow the same
>>>>> logic as a map reduce java job but I seem to be wrong.
>>>>> I started out with 73 gzipped files that vary between 23MB to 255MB in
>>>>> size.
>>>>> My default block size was 128MB.  8 of the 73 files are larger than 128
>>>>> MB
>>>>> When I ran my streaming job, it ran, as expected,  73 mappers ( No
>>>>> reducers
>>>>> for this job).
>>>>> Since I have 128 Nodes in my cluster , I thought I would use more
>>>>> systems
>>>>> in
>>>>> the cluster by increasing the number of mappers. I changed all the gzip
>>>>> files into bzip2 files. I expected the number of mappers to increase to
>>>>> 81.
>>>>> The mappers remained at 73.
>>>>> I tried a second experiment- I changed my dfs.block.size to 32MB. That
>>>>> should have increased my mappers to about ~250. It remains steadfast at
>>>>> 73.
>>&

Re: Input split for a streaming job!

2011-11-11 Thread Raj V
Joey,Anirudh, Bejoy

I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat).

And the input files were created using 32MB block size and the files are bzip2.

So all things point to my input files being spliitable.

I  will continue poking around.

- best regards

Raj



>
>From: Joey Echeverria 
>To: Raj V 
>Sent: Friday, November 11, 2011 2:56 AM
>Subject: Re: Input split for a streaming job!
>
>U1 should be able to split the bzip2 files. What input format are you using?
>
>-Joey
>
>On Thu, Nov 10, 2011 at 9:06 PM, Raj V  wrote:
>> Sorry to bother you offline.
>> From the release notes for CDH3U1
>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html)
>> I understand that split of the bzip files was available.
>> But returning to my old problem I still see 73 mappers. Did I misunderstand
>> something?
>> If necessary, I can re-post the mail to the group.,
>>
>> 
>> From: Joey Echeverria 
>> To: rajv...@yahoo.com
>> Sent: Thursday, November 10, 2011 3:11 PM
>> Subject: Re: Input split for a streaming job!
>>
>> No problem. Out of curiosity, why are you still using B3?
>>
>> -Joey
>>
>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V  wrote:
>>> Joey
>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does
>>> not
>>> seem to support bzip splitting. I should have looked before shooting off
>>> the
>>> email :-(
>>> To answer your second question, I created a completely new set of input
>>> files with dfs.block.size=32MB and used this as the input data
>>> Raj
>>>
>>>
>>> 
>>> From: Joey Echeverria 
>>> To: cdh-u...@cloudera.org
>>> Sent: Thursday, November 10, 2011 3:02 PM
>>> Subject: Re: Input split for a streaming job!
>>>
>>> It depends on the version of hadoop that you're using. Also, when you
>>> changed the block size, did you do it on the actual files, or just the
>>> default for new files?
>>>
>>> -Joey
>>>
>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V  wrote:
>>>> Hi Joey,
>>>> I always thought bzip was splittable.
>>>> Raj
>>>>
>>>> 
>>>> From: Joey Echeverria 
>>>> To: cdh-u...@cloudera.org
>>>> Sent: Thursday, November 10, 2011 2:43 PM
>>>> Subject: Re: Input split for a streaming job!
>>>>
>>>> Gzip and bzip2 compressed files aren't splittable, so you'll always
>>>> get one mapper per file.
>>>>
>>>> -Joey
>>>>
>>>> On Thu, Nov 10, 2011 at 5:40 PM, Raj V  wrote:
>>>>> All
>>>>> I assumed that the input splits for a streaming job will follow the same
>>>>> logic as a map reduce java job but I seem to be wrong.
>>>>> I started out with 73 gzipped files that vary between 23MB to 255MB in
>>>>> size.
>>>>> My default block size was 128MB.  8 of the 73 files are larger than 128
>>>>> MB
>>>>> When I ran my streaming job, it ran, as expected,  73 mappers ( No
>>>>> reducers
>>>>> for this job).
>>>>> Since I have 128 Nodes in my cluster , I thought I would use more
>>>>> systems
>>>>> in
>>>>> the cluster by increasing the number of mappers. I changed all the gzip
>>>>> files into bzip2 files. I expected the number of mappers to increase to
>>>>> 81.
>>>>> The mappers remained at 73.
>>>>> I tried a second experiment- I changed my dfs.block.size to 32MB. That
>>>>> should have increased my mappers to about ~250. It remains steadfast at
>>>>> 73.
>>>>> Is my understanding wrong? With a smaller block size and bzipped files,
>>>>> should I not get more mappers?
>>>>> Raj
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Joseph Echeverria
>>>> Cloudera, Inc.
>>>> 443.305.9434
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>>>
>>>
>>>
>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>>
>>
>>
>
>
>
>-- 
>Joseph Echeverria
>Cloudera, Inc.
>443.305.9434
>
>
>

Re: Input split for a streaming job!

2011-11-11 Thread Bejoy KS
Hi Raj
  Is your Streaming job using WholeFileInput Format or some Custom
Input Format that reads files as a whole? If that is the case then this is
the expected behavior.
Also you mentioned you changed the dfs.block.size to 32 Mb.AFAIK
this value would be applicable only for new files into hdfs, the existing
files in hdfs would be having the previous block size itself. Also to test
some scenarios you need to specify the block size on the cluster level, you
specify on the file level while copying the same into hdfs with
hadoop dfs -D dfs.block.size=16777216 -copyFromLocal /src/file /dest/file
Did you do the same way and still it doesn't vary the number of mappers.

AFAIK bzip2 is splittable. Please correct me if I'm wrong.

On Fri, Nov 11, 2011 at 2:07 PM, Anirudh Jhina wrote:

> Raj,
>
> What InputFormat are you using? The compressed format is not splittable, so
> if you have 73 gzip files, there will be 73 corresponding mappers for each
> file respectively. Look at the TextInputFormat.isSplittable() description.
>
> Thanks,
> ~Anirudh
>
> On Thu, Nov 10, 2011 at 2:40 PM, Raj V  wrote:
>
> > All
> >
> > I assumed that the input splits for a streaming job will follow the same
> > logic as a map reduce java job but I seem to be wrong.
> >
> > I started out with 73 gzipped files that vary between 23MB to 255MB in
> > size. My default block size was 128MB.  8 of the 73 files are larger than
> > 128 MB
> >
> > When I ran my streaming job, it ran, as expected,  73 mappers ( No
> > reducers for this job).
> >
> > Since I have 128 Nodes in my cluster , I thought I would use more systems
> > in the cluster by increasing the number of mappers. I changed all the
> gzip
> > files into bzip2 files. I expected the number of mappers to increase to
> 81.
> > The mappers remained at 73.
> >
> > I tried a second experiment- I changed my dfs.block.size to 32MB. That
> > should have increased my mappers to about ~250. It remains steadfast at
> 73.
> >
> > Is my understanding wrong? With a smaller block size and bzipped files,
> > should I not get more mappers?
> >
> > Raj
>


Re: Input split for a streaming job!

2011-11-11 Thread Anirudh Jhina
Raj,

What InputFormat are you using? The compressed format is not splittable, so
if you have 73 gzip files, there will be 73 corresponding mappers for each
file respectively. Look at the TextInputFormat.isSplittable() description.

Thanks,
~Anirudh

On Thu, Nov 10, 2011 at 2:40 PM, Raj V  wrote:

> All
>
> I assumed that the input splits for a streaming job will follow the same
> logic as a map reduce java job but I seem to be wrong.
>
> I started out with 73 gzipped files that vary between 23MB to 255MB in
> size. My default block size was 128MB.  8 of the 73 files are larger than
> 128 MB
>
> When I ran my streaming job, it ran, as expected,  73 mappers ( No
> reducers for this job).
>
> Since I have 128 Nodes in my cluster , I thought I would use more systems
> in the cluster by increasing the number of mappers. I changed all the gzip
> files into bzip2 files. I expected the number of mappers to increase to 81.
> The mappers remained at 73.
>
> I tried a second experiment- I changed my dfs.block.size to 32MB. That
> should have increased my mappers to about ~250. It remains steadfast at 73.
>
> Is my understanding wrong? With a smaller block size and bzipped files,
> should I not get more mappers?
>
> Raj


Input split for a streaming job!

2011-11-10 Thread Raj V
All

I assumed that the input splits for a streaming job will follow the same logic 
as a map reduce java job but I seem to be wrong. 

I started out with 73 gzipped files that vary between 23MB to 255MB in size. My 
default block size was 128MB.  8 of the 73 files are larger than 128 MB

When I ran my streaming job, it ran, as expected,  73 mappers ( No reducers for 
this job).

Since I have 128 Nodes in my cluster , I thought I would use more systems in 
the cluster by increasing the number of mappers. I changed all the gzip files 
into bzip2 files. I expected the number of mappers to increase to 81. The 
mappers remained at 73.

I tried a second experiment- I changed my dfs.block.size to 32MB. That should 
have increased my mappers to about ~250. It remains steadfast at 73.

Is my understanding wrong? With a smaller block size and bzipped files, should 
I not get more mappers?

Raj