Re: Are hadoop fs commands serial or parallel

2011-05-20 Thread Dieter Plaetinck
What do you mean clunky?
IMHO this is a quite elegant, simple, working solution.
Sure this spawns multiple processes, but it beats any
api-overcomplications, imho.

Dieter


On Wed, 18 May 2011 11:39:36 -0500
Patrick Angeles  wrote:

> kinda clunky but you could do this via shell:
> 
> for $FILE in $LIST_OF_FILES ; do
>   hadoop fs -copyFromLocal $FILE $DEST_PATH &
> done
> 
> If doing this via the Java API, then, yes you will have to use
> multiple threads.
> 
> On Wed, May 18, 2011 at 1:04 AM, Mapred Learn
> wrote:
> 
> > Thanks harsh !
> > That means basically both APIs as well as hadoop client commands
> > allow only serial writes.
> > I was wondering what could be other ways to write data in parallel
> > to HDFS other than using multiple parallel threads.
> >
> > Thanks,
> > JJ
> >
> > Sent from my iPhone
> >
> > On May 17, 2011, at 10:59 PM, Harsh J  wrote:
> >
> > > Hello,
> > >
> > > Adding to Joey's response, copyFromLocal's current implementation
> > > is
> > serial
> > > given a list of files.
> > >
> > > On Wed, May 18, 2011 at 9:57 AM, Mapred Learn
> > >  wrote:
> > >> Thanks Joey !
> > >> I will try to find out abt copyFromLocal. Looks like Hadoop Apis
> > >> write
> > > serially as you pointed out.
> > >>
> > >> Thanks,
> > >> -JJ
> > >>
> > >> On May 17, 2011, at 8:32 PM, Joey Echeverria 
> > >> wrote:
> > >>
> > >>> The sequence file writer definitely does it serially as you can
> > >>> only ever write to the end of a file in Hadoop.
> > >>>
> > >>> Doing copyFromLocal could write multiple files in parallel (I'm
> > >>> not sure if it does or not), but a single file would be written
> > >>> serially.
> > >>>
> > >>> -Joey
> > >>>
> > >>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn
> > >>> 
> > > wrote:
> >  Hi,
> >  My question is when I run a command from hdfs client, for eg.
> >  hadoop
> > fs
> >  -copyFromLocal or create a sequence file writer in java code
> >  and
> > append
> >  key/values to it through Hadoop APIs, does it internally
> > transfer/write
> > > data
> >  to HDFS serially or in parallel ?
> > 
> >  Thanks in advance,
> >  -JJ
> > 
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Joseph Echeverria
> > >>> Cloudera, Inc.
> > >>> 443.305.9434
> > >>
> > >
> > > --
> > > Harsh J
> >



Re: outputCollector vs. Localfile

2011-05-20 Thread Harsh J
Mark,

On Fri, May 20, 2011 at 10:17 AM, Mark question  wrote:
> This is puzzling me ...
>
>  With a mapper producing output of size ~ 400 MB ... which one is supposed
> to be faster?
>
>  1) output collector: which will write to local file then copy to HDFS since
> I don't have reducers.

A regular map-only job does not write to the local FS, it writes to
the HDFS directly (i.e., a local DN if one is found).

-- 
Harsh J


Why Only 1 Reducer is running ??

2011-05-20 Thread praveenesh kumar
Hello everyone,

I am using wordcount application to test on my hadoop cluster of 5 nodes.
The file size is around 5 GB.
Its taking around 2 min - 40 sec for execution.
But when I am checking the JobTracker web portal, I am seeing only one
reducer is running. Why so  ??
How can I change the code so that I will run multiple reducers also ??

Thanks,
Praveenesh


Re: Why Only 1 Reducer is running ??

2011-05-20 Thread James Seigel Tynt
The job could be designed to use one reducer

On 2011-05-20, at 7:19 AM, praveenesh kumar  wrote:

> Hello everyone,
> 
> I am using wordcount application to test on my hadoop cluster of 5 nodes.
> The file size is around 5 GB.
> Its taking around 2 min - 40 sec for execution.
> But when I am checking the JobTracker web portal, I am seeing only one
> reducer is running. Why so  ??
> How can I change the code so that I will run multiple reducers also ??
> 
> Thanks,
> Praveenesh


Re: Why Only 1 Reducer is running ??

2011-05-20 Thread praveenesh kumar
I am using the wordcount example that comes along with hadoop.
How can I configure it to make it use multiple reducers.
I guess mutiple reducers will make it run more fast .. Does it ??


On Fri, May 20, 2011 at 6:51 PM, James Seigel Tynt  wrote:

> The job could be designed to use one reducer
>
> On 2011-05-20, at 7:19 AM, praveenesh kumar  wrote:
>
> > Hello everyone,
> >
> > I am using wordcount application to test on my hadoop cluster of 5 nodes.
> > The file size is around 5 GB.
> > Its taking around 2 min - 40 sec for execution.
> > But when I am checking the JobTracker web portal, I am seeing only one
> > reducer is running. Why so  ??
> > How can I change the code so that I will run multiple reducers also ??
> >
> > Thanks,
> > Praveenesh
>


Re: Why Only 1 Reducer is running ??

2011-05-20 Thread modemide
what does your mapred-site.xml file say?

I've used wordcount and had close to 12 reduces running on a 6
datanode cluster on a 3 GB file.


I have a configuration in there which says:
mapred.reduce.tasks = 12

The reason I chose 12 was because it was recommended that I choose 2x
number of tasktrackers.





On 5/20/11, praveenesh kumar  wrote:
> Hello everyone,
>
> I am using wordcount application to test on my hadoop cluster of 5 nodes.
> The file size is around 5 GB.
> Its taking around 2 min - 40 sec for execution.
> But when I am checking the JobTracker web portal, I am seeing only one
> reducer is running. Why so  ??
> How can I change the code so that I will run multiple reducers also ??
>
> Thanks,
> Praveenesh
>


Re: Why Only 1 Reducer is running ??

2011-05-20 Thread praveenesh kumar
Hi,

My mapred-site.xml is pretty simple.







mapred.job.tracker
ub13:54311
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and reduce task.




where I should put the settings that you are saying ??



On Fri, May 20, 2011 at 6:55 PM, modemide  wrote:

> what does your mapred-site.xml file say?
>
> I've used wordcount and had close to 12 reduces running on a 6
> datanode cluster on a 3 GB file.
>
>
> I have a configuration in there which says:
> mapred.reduce.tasks = 12
>
> The reason I chose 12 was because it was recommended that I choose 2x
> number of tasktrackers.
>
>
>
>
>
> On 5/20/11, praveenesh kumar  wrote:
> > Hello everyone,
> >
> > I am using wordcount application to test on my hadoop cluster of 5 nodes.
> > The file size is around 5 GB.
> > Its taking around 2 min - 40 sec for execution.
> > But when I am checking the JobTracker web portal, I am seeing only one
> > reducer is running. Why so  ??
> > How can I change the code so that I will run multiple reducers also ??
> >
> > Thanks,
> > Praveenesh
> >
>


RE: Why Only 1 Reducer is running ??

2011-05-20 Thread Evert Lammerts
Hi Praveenesh,

* You can set the maximum amount of reducers per node in your mapred-site.xml 
using mapred.tasktracker.reduce.tasks.maximum (default set to 2).
* You can set the default number of reduce tasks with mapred.reduce.tasks 
(default set to 1 - this causes your single reducer).
* Your job can try to override this setting by calling 
Job.setNumReduceTasks(INT) 
(http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setNumReduceTasks(int)).

Cheers,
Evert


> -Original Message-
> From: modemide [mailto:modem...@gmail.com]
> Sent: vrijdag 20 mei 2011 15:26
> To: common-user@hadoop.apache.org
> Subject: Re: Why Only 1 Reducer is running ??
>
> what does your mapred-site.xml file say?
>
> I've used wordcount and had close to 12 reduces running on a 6
> datanode cluster on a 3 GB file.
>
>
> I have a configuration in there which says:
> mapred.reduce.tasks = 12
>
> The reason I chose 12 was because it was recommended that I choose 2x
> number of tasktrackers.
>
>
>
>
>
> On 5/20/11, praveenesh kumar  wrote:
> > Hello everyone,
> >
> > I am using wordcount application to test on my hadoop cluster of 5
> nodes.
> > The file size is around 5 GB.
> > Its taking around 2 min - 40 sec for execution.
> > But when I am checking the JobTracker web portal, I am seeing only
> one
> > reducer is running. Why so  ??
> > How can I change the code so that I will run multiple reducers also
> ??
> >
> > Thanks,
> > Praveenesh
> >


Re: Are hadoop fs commands serial or parallel

2011-05-20 Thread Brian Bockelman

On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote:

> What do you mean clunky?
> IMHO this is a quite elegant, simple, working solution.

Try giving it to a user; watch them feed it a list of 10,000 files; watch the 
machine swap to death and the disks uselessly thrash.

> Sure this spawns multiple processes, but it beats any
> api-overcomplications, imho.
> 

Simple doesn't imply scalable, unfortunately.

Brian

> Dieter
> 
> 
> On Wed, 18 May 2011 11:39:36 -0500
> Patrick Angeles  wrote:
> 
>> kinda clunky but you could do this via shell:
>> 
>> for $FILE in $LIST_OF_FILES ; do
>>  hadoop fs -copyFromLocal $FILE $DEST_PATH &
>> done
>> 
>> If doing this via the Java API, then, yes you will have to use
>> multiple threads.
>> 
>> On Wed, May 18, 2011 at 1:04 AM, Mapred Learn
>> wrote:
>> 
>>> Thanks harsh !
>>> That means basically both APIs as well as hadoop client commands
>>> allow only serial writes.
>>> I was wondering what could be other ways to write data in parallel
>>> to HDFS other than using multiple parallel threads.
>>> 
>>> Thanks,
>>> JJ
>>> 
>>> Sent from my iPhone
>>> 
>>> On May 17, 2011, at 10:59 PM, Harsh J  wrote:
>>> 
 Hello,
 
 Adding to Joey's response, copyFromLocal's current implementation
 is
>>> serial
 given a list of files.
 
 On Wed, May 18, 2011 at 9:57 AM, Mapred Learn
  wrote:
> Thanks Joey !
> I will try to find out abt copyFromLocal. Looks like Hadoop Apis
> write
 serially as you pointed out.
> 
> Thanks,
> -JJ
> 
> On May 17, 2011, at 8:32 PM, Joey Echeverria 
> wrote:
> 
>> The sequence file writer definitely does it serially as you can
>> only ever write to the end of a file in Hadoop.
>> 
>> Doing copyFromLocal could write multiple files in parallel (I'm
>> not sure if it does or not), but a single file would be written
>> serially.
>> 
>> -Joey
>> 
>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn
>> 
 wrote:
>>> Hi,
>>> My question is when I run a command from hdfs client, for eg.
>>> hadoop
>>> fs
>>> -copyFromLocal or create a sequence file writer in java code
>>> and
>>> append
>>> key/values to it through Hadoop APIs, does it internally
>>> transfer/write
 data
>>> to HDFS serially or in parallel ?
>>> 
>>> Thanks in advance,
>>> -JJ
>>> 
>> 
>> 
>> 
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
> 
 
 --
 Harsh J
>>> 



smime.p7s
Description: S/MIME cryptographic signature


Configuring jvm metrics in hadoop-0.20.203.0

2011-05-20 Thread Matyas Markovics
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,
I am trying to get jvm metrics from the new verison of hadoop.
I have read the migration instructions and come up with the following
content for hadoop-metrics2.properties:

*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
jvm.sink.file.period=2
jvm.sink.file.filename=/home/ec2-user/jvmmetrics.log

Any help would be appreciated even if you have a different approach to
get memory usage from reducers.

Thanks in advance.
- -- 
Best Regards,
Matyas Markovics
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3WkKoACgkQGp7rraycDA7lMQCbBbVqYEyOdwVAjfDHvGtr58BN
nYUAn39gGORQKwVzt+Mlz8gZZlYYdymF
=1GJs
-END PGP SIGNATURE-


Re: outputCollector vs. Localfile

2011-05-20 Thread Mark question
I thought it was, because of FileBytesWritten counter. Thanks for the
clarification.
Mark

On Fri, May 20, 2011 at 4:23 AM, Harsh J  wrote:

> Mark,
>
> On Fri, May 20, 2011 at 10:17 AM, Mark question 
> wrote:
> > This is puzzling me ...
> >
> >  With a mapper producing output of size ~ 400 MB ... which one is
> supposed
> > to be faster?
> >
> >  1) output collector: which will write to local file then copy to HDFS
> since
> > I don't have reducers.
>
> A regular map-only job does not write to the local FS, it writes to
> the HDFS directly (i.e., a local DN if one is found).
>
> --
> Harsh J
>


What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread W.P. McNeill
I've got a directory with a bunch of MapReduce data in it.  I want to know
how many  pairs it contains.  I could write a mapper-only
process that takes  pairs as input and updates a
counter, but it seems like this utility should already exist.  Does it, or
do I have to roll my own?

Bonus question, is there a way to count the number of  pairs
without deserializing the values?  This can be expensive for the data I'm
working with.


Can I number output results with a Counter?

2011-05-20 Thread Mark Kerzner
Hi, can I use a Counter to give each record in all reducers a consecutive
number? Currently I am using a single Reducer, but it is an anti-pattern.
But I need to assign consecutive numbers to all output records in all
reducers, and it does not matter how, as long as each gets its own number.

If it IS possible, then how are multiple processes accessing those counters
without creating race conditions.

Thank you,

Mark


Re: What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread Joey Echeverria
What format is the input data in?

At first glance, I would run an identity mapper and use a
NullOutputFormat so you don't get any data written. The built in
counters already count the number of key, value pairs read in by the
mappers.

-Joey

On Fri, May 20, 2011 at 9:34 AM, W.P. McNeill  wrote:
> I've got a directory with a bunch of MapReduce data in it.  I want to know
> how many  pairs it contains.  I could write a mapper-only
> process that takes  pairs as input and updates a
> counter, but it seems like this utility should already exist.  Does it, or
> do I have to roll my own?
>
> Bonus question, is there a way to count the number of  pairs
> without deserializing the values?  This can be expensive for the data I'm
> working with.
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread James Seigel
The cheapest way would be to check the counters as you write them in
the first place and keep a running score. :)

Sent from my mobile. Please excuse the typos.

On 2011-05-20, at 10:35 AM, "W.P. McNeill"  wrote:

> I've got a directory with a bunch of MapReduce data in it.  I want to know
> how many  pairs it contains.  I could write a mapper-only
> process that takes  pairs as input and updates a
> counter, but it seems like this utility should already exist.  Does it, or
> do I have to roll my own?
>
> Bonus question, is there a way to count the number of  pairs
> without deserializing the values?  This can be expensive for the data I'm
> working with.


Re: Can I number output results with a Counter?

2011-05-20 Thread Joey Echeverria
To make sure I understand you correctly, you need a globally unique
one up counter for each output record?

If you had an upper bound on the number of records a single reducer
could output and you can afford to have gaps, you could just use the
task id and multiply that by the max number of records and then one up
from there.

If that doesn't work for you, then you'll need to use some kind of
central service for allocating numbers which could become a
bottleneck.

-Joey

On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner  wrote:
> Hi, can I use a Counter to give each record in all reducers a consecutive
> number? Currently I am using a single Reducer, but it is an anti-pattern.
> But I need to assign consecutive numbers to all output records in all
> reducers, and it does not matter how, as long as each gets its own number.
>
> If it IS possible, then how are multiple processes accessing those counters
> without creating race conditions.
>
> Thank you,
>
> Mark
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: Can I number output results with a Counter?

2011-05-20 Thread Mark Kerzner
Joey,

You understood me perfectly well. I see your first advice, but I am not
allowed to have gaps. A central service is something I may consider if
single reducer becomes a worse bottleneck than it.

But what are counters for? They seem to be exactly that.

Mark

On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria  wrote:

> To make sure I understand you correctly, you need a globally unique
> one up counter for each output record?
>
> If you had an upper bound on the number of records a single reducer
> could output and you can afford to have gaps, you could just use the
> task id and multiply that by the max number of records and then one up
> from there.
>
> If that doesn't work for you, then you'll need to use some kind of
> central service for allocating numbers which could become a
> bottleneck.
>
> -Joey
>
> On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner 
> wrote:
> > Hi, can I use a Counter to give each record in all reducers a consecutive
> > number? Currently I am using a single Reducer, but it is an anti-pattern.
> > But I need to assign consecutive numbers to all output records in all
> > reducers, and it does not matter how, as long as each gets its own
> number.
> >
> > If it IS possible, then how are multiple processes accessing those
> counters
> > without creating race conditions.
> >
> > Thank you,
> >
> > Mark
> >
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>


Re: What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread W.P. McNeill
The keys are Text and the values are large custom data structures serialized
with Avro.

I also have counters for the job that generates these files that gives me
this information but sometimes...Well, it's a long story.  Suffice to say
that it's nice to have a post-hoc method too.  :-)

The identity mapper sounds like the way to go.


Re: Can I number output results with a Counter?

2011-05-20 Thread Joey Echeverria
Counters are a way to get status from your running job. They don't
increment a global state. They locally save increments and
periodically report those increments to the central counter. That
means that the final count will be correct, but you can't use them to
coordinate counts while your job is running.

-Joey

On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner  wrote:
> Joey,
>
> You understood me perfectly well. I see your first advice, but I am not
> allowed to have gaps. A central service is something I may consider if
> single reducer becomes a worse bottleneck than it.
>
> But what are counters for? They seem to be exactly that.
>
> Mark
>
> On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria  wrote:
>
>> To make sure I understand you correctly, you need a globally unique
>> one up counter for each output record?
>>
>> If you had an upper bound on the number of records a single reducer
>> could output and you can afford to have gaps, you could just use the
>> task id and multiply that by the max number of records and then one up
>> from there.
>>
>> If that doesn't work for you, then you'll need to use some kind of
>> central service for allocating numbers which could become a
>> bottleneck.
>>
>> -Joey
>>
>> On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner 
>> wrote:
>> > Hi, can I use a Counter to give each record in all reducers a consecutive
>> > number? Currently I am using a single Reducer, but it is an anti-pattern.
>> > But I need to assign consecutive numbers to all output records in all
>> > reducers, and it does not matter how, as long as each gets its own
>> number.
>> >
>> > If it IS possible, then how are multiple processes accessing those
>> counters
>> > without creating race conditions.
>> >
>> > Thank you,
>> >
>> > Mark
>> >
>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread Joey Echeverria
Are you storing the data in sequence files?

-Joey

On Fri, May 20, 2011 at 10:33 AM, W.P. McNeill  wrote:
> The keys are Text and the values are large custom data structures serialized
> with Avro.
>
> I also have counters for the job that generates these files that gives me
> this information but sometimes...Well, it's a long story.  Suffice to say
> that it's nice to have a post-hoc method too.  :-)
>
> The identity mapper sounds like the way to go.
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread W.P. McNeill
No.


Re: Can I number output results with a Counter?

2011-05-20 Thread Kai Voigt
Also, with speculative execution enabled, you might see a higher count as you 
expect while the same task is running multiple times in parallel. When a task 
gets killed because another instance was quicker, those counters will be 
removed from the global count though.

Kai

Am 20.05.2011 um 19:34 schrieb Joey Echeverria:

> Counters are a way to get status from your running job. They don't
> increment a global state. They locally save increments and
> periodically report those increments to the central counter. That
> means that the final count will be correct, but you can't use them to
> coordinate counts while your job is running.
> 
> -Joey
> 
> On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner  wrote:
>> Joey,
>> 
>> You understood me perfectly well. I see your first advice, but I am not
>> allowed to have gaps. A central service is something I may consider if
>> single reducer becomes a worse bottleneck than it.
>> 
>> But what are counters for? They seem to be exactly that.
>> 
>> Mark
>> 
>> On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria  wrote:
>> 
>>> To make sure I understand you correctly, you need a globally unique
>>> one up counter for each output record?
>>> 
>>> If you had an upper bound on the number of records a single reducer
>>> could output and you can afford to have gaps, you could just use the
>>> task id and multiply that by the max number of records and then one up
>>> from there.
>>> 
>>> If that doesn't work for you, then you'll need to use some kind of
>>> central service for allocating numbers which could become a
>>> bottleneck.
>>> 
>>> -Joey
>>> 
>>> On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner 
>>> wrote:
 Hi, can I use a Counter to give each record in all reducers a consecutive
 number? Currently I am using a single Reducer, but it is an anti-pattern.
 But I need to assign consecutive numbers to all output records in all
 reducers, and it does not matter how, as long as each gets its own
>>> number.
 
 If it IS possible, then how are multiple processes accessing those
>>> counters
 without creating race conditions.
 
 Thank you,
 
 Mark
 
>>> 
>>> 
>>> 
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>>> 
>> 
> 
> 
> 
> -- 
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
> 

-- 
Kai Voigt
k...@123.org






Re: Can I number output results with a Counter?

2011-05-20 Thread Mark Kerzner
Thank you, Kai and Joey, for the explanation. That's what I thought about
them, but did not want to miss the "magical" replacement for a central
services in the counters. No, there is no magic, just great reality.

Mark

On Fri, May 20, 2011 at 12:39 PM, Kai Voigt  wrote:

> Also, with speculative execution enabled, you might see a higher count as
> you expect while the same task is running multiple times in parallel. When a
> task gets killed because another instance was quicker, those counters will
> be removed from the global count though.
>
> Kai
>
> Am 20.05.2011 um 19:34 schrieb Joey Echeverria:
>
> > Counters are a way to get status from your running job. They don't
> > increment a global state. They locally save increments and
> > periodically report those increments to the central counter. That
> > means that the final count will be correct, but you can't use them to
> > coordinate counts while your job is running.
> >
> > -Joey
> >
> > On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner 
> wrote:
> >> Joey,
> >>
> >> You understood me perfectly well. I see your first advice, but I am not
> >> allowed to have gaps. A central service is something I may consider if
> >> single reducer becomes a worse bottleneck than it.
> >>
> >> But what are counters for? They seem to be exactly that.
> >>
> >> Mark
> >>
> >> On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria 
> wrote:
> >>
> >>> To make sure I understand you correctly, you need a globally unique
> >>> one up counter for each output record?
> >>>
> >>> If you had an upper bound on the number of records a single reducer
> >>> could output and you can afford to have gaps, you could just use the
> >>> task id and multiply that by the max number of records and then one up
> >>> from there.
> >>>
> >>> If that doesn't work for you, then you'll need to use some kind of
> >>> central service for allocating numbers which could become a
> >>> bottleneck.
> >>>
> >>> -Joey
> >>>
> >>> On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner 
> >>> wrote:
>  Hi, can I use a Counter to give each record in all reducers a
> consecutive
>  number? Currently I am using a single Reducer, but it is an
> anti-pattern.
>  But I need to assign consecutive numbers to all output records in all
>  reducers, and it does not matter how, as long as each gets its own
> >>> number.
> 
>  If it IS possible, then how are multiple processes accessing those
> >>> counters
>  without creating race conditions.
> 
>  Thank you,
> 
>  Mark
> 
> >>>
> >>>
> >>>
> >>> --
> >>> Joseph Echeverria
> >>> Cloudera, Inc.
> >>> 443.305.9434
> >>>
> >>
> >
> >
> >
> > --
> > Joseph Echeverria
> > Cloudera, Inc.
> > 443.305.9434
> >
>
> --
> Kai Voigt
> k...@123.org
>
>
>
>
>


Problem: Unknown scheme hdfs. It should correspond to a JournalType enumeration value

2011-05-20 Thread Eduardo Dario Ricci
Hy People

I'm starting in hadoop commom.. and got some problem to try using a
cluster..

I'm following the steps of this page:
http://hadoop.apache.org/common/docs/r0.21.0/cluster_setup.html

I done everything, but when I will format the HDFS, this error happens:

I searched for something to help-me, but didn't find nothing.


 If some guy could help-me, I will be thankfull.



Re-format filesystem in /fontes/cluster/namedir ? (Y or N) Y
11/05/20 16:41:40 ERROR namenode.NameNode: java.io.IOException: Unknown
scheme hdfs. It should correspond to a JournalType enumeration value
at
org.apache.hadoop.hdfs.server.namenode.FSImage.checkSchemeConsistency(FSImage.java:269)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.setStorageDirectories(FSImage.java:222)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.(FSImage.java:178)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1240)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1348)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)

11/05/20 16:41:40 INFO namenode.NameNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.217.134
/




-- 
 
 Eduardo Dario Ricci
   Cel: 14-81354813
 MSN: thenigma...@hotmail.com


Re: Problem: Unknown scheme hdfs. It should correspond to a JournalType enumeration value

2011-05-20 Thread Todd Lipcon
Hi Eduardo,

Sounds like you've configured your dfs.name.dirs to be on HDFS instead
of like file paths.

-Todd

On Fri, May 20, 2011 at 2:20 PM, Eduardo Dario Ricci  wrote:
> Hy People
>
> I'm starting in hadoop commom.. and got some problem to try using a
> cluster..
>
> I'm following the steps of this page:
> http://hadoop.apache.org/common/docs/r0.21.0/cluster_setup.html
>
> I done everything, but when I will format the HDFS, this error happens:
>
> I searched for something to help-me, but didn't find nothing.
>
>
>  If some guy could help-me, I will be thankfull.
>
>
>
> Re-format filesystem in /fontes/cluster/namedir ? (Y or N) Y
> 11/05/20 16:41:40 ERROR namenode.NameNode: java.io.IOException: Unknown
> scheme hdfs. It should correspond to a JournalType enumeration value
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.checkSchemeConsistency(FSImage.java:269)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.setStorageDirectories(FSImage.java:222)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.(FSImage.java:178)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1240)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1348)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)
>
> 11/05/20 16:41:40 INFO namenode.NameNode: SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.217.134
> /
>
>
>
>
> --
>  
>             Eduardo Dario Ricci
>               Cel: 14-81354813
>     MSN: thenigma...@hotmail.com
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Using df instead of du to calculate datanode space

2011-05-20 Thread Joe Stein
I came up with a nice little hack to trick hadoop into calculating disk
usage with df instead of du

http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/

I am running this in production, works like a charm and already
seeing benefit, woot!

I hope it works well for others too.

/*
Joe Stein
http://www.twitter.com/allthingshadoop
*/


Re: Applications creates bigger output than input?

2011-05-20 Thread elton sky
Thanks Robert, Niels

Ye, I think text manipulation, especially ngram is a good application for
me.
Cheers

On Fri, May 20, 2011 at 12:57 AM, Robert Evans  wrote:

> I'm not sure if this has been mentioned or not but in Machine Learning with
> text based documents, the first stage is often a glorified word count
> action.  Except much of the time they will do N-Gram.  So
>
> Map Input:
> "Hello this is a test"
>
> Map Output:
> "Hello"
> "This"
> "is"
> "a"
> "test"
> "Hello" "this"
> "this" "is"
> "is" "a"
> "a" "test"
> ...
>
>
> You may also be extracting all kinds of other features form the text, but
> the tokenization/n-gram is not that CPU intensive.
>
> --Bobby Evans
>
> On 5/19/11 3:06 AM, "elton sky"  wrote:
>
> Hello,
> I pick up this topic again, because what I am looking for is something not
> CPU bound. Augmenting data for ETL and generating index are good examples.
> Neither of them requires too much cpu time on map side. The main bottle
> neck
> for them is shuffle and merge.
>
> Market basket analysis is cpu intensive in map phase, for sampling all
> possible combinations of items.
>
> I am still looking for more applications, which creates bigger output and
> not CPU bound.
> Any further idea? I appreciate.
>
>
> On Tue, May 3, 2011 at 3:10 AM, Steve Loughran  wrote:
>
> > On 30/04/2011 05:31, elton sky wrote:
> >
> >> Thank you for suggestions:
> >>
> >> Weblog analysis, market basket analysis and generating search index.
> >>
> >> I guess for these applications we need more reduces than maps, for
> >> handling
> >> large intermediate output, isn't it. Besides, the input split for map
> >> should
> >> be smaller than usual,  because the workload for spill and merge on
> map's
> >> local disk is heavy.
> >>
> >
> > any form of rendering can generate very large images
> >
> > see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
> >
> >
> >
>
>


How to see block information on NameNode ?

2011-05-20 Thread praveenesh kumar
hey..!!

I have a question.
If I copy some file on HDFS file system, it will get split into blocks and
Namenode will keep all these meta info with it.
How can I see that info.
I copied 5 GB file on NameNode, but I see that file only on the NameNode..
It doesnot get split into blocks..??
How can I see whether my file is getting split into blocks and which data
node is keeping which block ??

Thanks,
Praveenesh