RE: Is there an additional overhead when storing data in HDFS?

2012-11-21 Thread WangRamon
Thank guys, great job.
 From: donta...@gmail.com
Date: Wed, 21 Nov 2012 13:23:08 +0530
Subject: Re: Is there an additional overhead when storing data in HDFS?
To: user@hadoop.apache.org

Hello Ramon,
 Why don't you go through this link once : 
http://www.aosabook.org/en/hdfs.htmlSuresh and guys have explained everything 
beautifully.


HTHRegards,Mohammad Tariq



On Wed, Nov 21, 2012 at 12:58 PM, Suresh Srinivas  
wrote:


Namenode will have trivial amount of data stored in journal/fsimage. 

On Tue, Nov 20, 2012 at 11:21 PM, WangRamon  wrote:






Thanks, besides the checksum data is there anything else? Data in name node?
 
Date: Tue, 20 Nov 2012 23:14:06 -0800
Subject: Re: Is there an additional overhead when storing data in HDFS?



From: sur...@hortonworks.com
To: user@hadoop.apache.org

HDFS uses 4GB for the file + checksum data.



Default is for every 512 bytes of data, 4 bytes of checksum are stored. In this 
case additional 32MB data.

On Tue, Nov 20, 2012 at 11:00 PM, WangRamon  wrote:







Hi All
 
I'm wondering if there is an additional overhead when storing some data into 
HDFS? For example, I have a 2GB file, the replicate factor of HDSF is 2, when 
the file is uploaded to HDFS, should HDFS use 4GB to store it or more then 4GB 
to store it? If it takes more than 4GB space, why?




 
Thanks
Ramon 
  


-- 
 http://hortonworks.com/download/


  


-- 
 http://hortonworks.com/download/



  

Re: reducer not starting

2012-11-21 Thread praveenesh kumar
Sometimes its network issue, reducers are not able to find hostnames or IPs
of the other machines. Make sure your /etc/hosts entries and hostnames are
correct.

Regards,
Praveenesh

On Tue, Nov 20, 2012 at 10:46 PM, Harsh J  wrote:

> Your mappers are failing (possibly a user-side error or an
> environmental one) and are being reattempted by the framework (default
> behavior, attempts 4 times to avoid transient failure scenario).
>
> Visit your job's logs in the JobTracker web UI, to find more
> information on why your tasks fail.
>
> On Tue, Nov 20, 2012 at 10:22 PM, jamal sasha 
> wrote:
> >
> >
> >
> > I am not sure whats happening, but I wrote a simple mapper and reducer
> > script.
> >
> >
> >
> > And I am testing it against a small dataset (like few lines long).
> >
> >
> >
> > For some reason reducer is just not starting.. and mapper is executing
> again
> > and again?
> >
> >
> >
> > 12/11/20 09:21:18 INFO streaming.StreamJob:  map 0%  reduce 0%
> >
> > 12/11/20 09:22:05 INFO streaming.StreamJob:  map 50%  reduce 0%
> >
> > 12/11/20 09:22:10 INFO streaming.StreamJob:  map 100%  reduce 0%
> >
> > 12/11/20 09:32:05 INFO streaming.StreamJob:  map 50%  reduce 0%
> >
> > 12/11/20 09:32:11 INFO streaming.StreamJob:  map 0%  reduce 0%
> >
> > 12/11/20 09:32:20 INFO streaming.StreamJob:  map 50%  reduce 0%
> >
> > 12/11/20 09:32:31 INFO streaming.StreamJob:  map 100%  reduce 0%
> >
> > 12/11/20 09:42:20 INFO streaming.StreamJob:  map 50%  reduce 0%
> >
> > 12/11/20 09:42:31 INFO streaming.StreamJob:  map 0%  reduce 0%
> >
> > 12/11/20 09:42:32 INFO streaming.StreamJob:  map 50%  reduce 0%
> >
> > 12/11/20 09:42:50 INFO streaming.StreamJob:  map 100%  reduce 0%
> >
> >
> >
> >
> >
> > Let me know if you want the code also.
> >
> > Any clues of where I am going wrong?
> >
> > Thanks
> >
> >
> >
> >
> >
> >
>
>
>
> --
> Harsh J
>


Re: Hadoop Web Interface Security

2012-11-21 Thread Harsh J
Yes, see http://hadoop.apache.org/docs/current/hadoop-auth/Configuration.html
and also see http://hadoop.apache.org/docs/stable/HttpAuthentication.html

On Wed, Nov 21, 2012 at 3:34 PM, Visioner Sadak
 wrote:
> Hi as we knw that by using hadoop's web UI  at   http://namenode-ip/50070
> anyone can access the hdfs details can we secure it only to certain
> authorized users and not publicly to all.. in production



-- 
Harsh J


Re: reducer not starting

2012-11-21 Thread Jean-Marc Spaggiari
Just FYI, you don't need to stop the job, update the host, and retry.

Just update the host while the job is running and it should retry and restart.

I had a similar issue with one of my node where the hosts file were
not updated. After the updated it has automatically resume the work...

JM

2012/11/21, praveenesh kumar :
> Sometimes its network issue, reducers are not able to find hostnames or IPs
> of the other machines. Make sure your /etc/hosts entries and hostnames are
> correct.
>
> Regards,
> Praveenesh
>
> On Tue, Nov 20, 2012 at 10:46 PM, Harsh J  wrote:
>
>> Your mappers are failing (possibly a user-side error or an
>> environmental one) and are being reattempted by the framework (default
>> behavior, attempts 4 times to avoid transient failure scenario).
>>
>> Visit your job's logs in the JobTracker web UI, to find more
>> information on why your tasks fail.
>>
>> On Tue, Nov 20, 2012 at 10:22 PM, jamal sasha 
>> wrote:
>> >
>> >
>> >
>> > I am not sure whats happening, but I wrote a simple mapper and reducer
>> > script.
>> >
>> >
>> >
>> > And I am testing it against a small dataset (like few lines long).
>> >
>> >
>> >
>> > For some reason reducer is just not starting.. and mapper is executing
>> again
>> > and again?
>> >
>> >
>> >
>> > 12/11/20 09:21:18 INFO streaming.StreamJob:  map 0%  reduce 0%
>> >
>> > 12/11/20 09:22:05 INFO streaming.StreamJob:  map 50%  reduce 0%
>> >
>> > 12/11/20 09:22:10 INFO streaming.StreamJob:  map 100%  reduce 0%
>> >
>> > 12/11/20 09:32:05 INFO streaming.StreamJob:  map 50%  reduce 0%
>> >
>> > 12/11/20 09:32:11 INFO streaming.StreamJob:  map 0%  reduce 0%
>> >
>> > 12/11/20 09:32:20 INFO streaming.StreamJob:  map 50%  reduce 0%
>> >
>> > 12/11/20 09:32:31 INFO streaming.StreamJob:  map 100%  reduce 0%
>> >
>> > 12/11/20 09:42:20 INFO streaming.StreamJob:  map 50%  reduce 0%
>> >
>> > 12/11/20 09:42:31 INFO streaming.StreamJob:  map 0%  reduce 0%
>> >
>> > 12/11/20 09:42:32 INFO streaming.StreamJob:  map 50%  reduce 0%
>> >
>> > 12/11/20 09:42:50 INFO streaming.StreamJob:  map 100%  reduce 0%
>> >
>> >
>> >
>> >
>> >
>> > Let me know if you want the code also.
>> >
>> > Any clues of where I am going wrong?
>> >
>> > Thanks
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>


Not able to change the priority of job using fair scheduler

2012-11-21 Thread Chunky Gupta
Hi,

I have enabled the fair scheduler and everything is set to default with
only few configuration changes. It is working fine and multiple users can
run queries simultaneously.
But I am not able to change the priority from " *http:///scheduler*" .
Priority column is coming as a simple test not with a drag down column.
Same thing with pool column. Which configuration I missed for this.

http://hadoop.apache.org/docs/r0.20.2/fair_scheduler.html

Above link doesn't say anything about this. Please help me.

Thanks,
Chunky.


Re: When speculative execution is true, there is a data loss issue with multpleoutputs

2012-11-21 Thread Radim Kolar
its not data loss, problem is caused that multipleoutputs do not work 
with standard committer if you do not write into subdirectory of main 
job output.


Re: reducer not starting

2012-11-21 Thread jamal sasha
Hi
  Thanks for the insights.
I noticed that these restarts of mappers were because in the shebang i had
Usr/env/bin instead of usr/env/bin python
Any clue of what was going on with reducers not starting but mappers being
executed again and again.
Probably a very naive question but i am newbie you see :)


On Wednesday, November 21, 2012, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:
> Just FYI, you don't need to stop the job, update the host, and retry.
>
> Just update the host while the job is running and it should retry and
restart.
>
> I had a similar issue with one of my node where the hosts file were
> not updated. After the updated it has automatically resume the work...
>
> JM
>
> 2012/11/21, praveenesh kumar :
>> Sometimes its network issue, reducers are not able to find hostnames or
IPs
>> of the other machines. Make sure your /etc/hosts entries and hostnames
are
>> correct.
>>
>> Regards,
>> Praveenesh
>>
>> On Tue, Nov 20, 2012 at 10:46 PM, Harsh J  wrote:
>>
>>> Your mappers are failing (possibly a user-side error or an
>>> environmental one) and are being reattempted by the framework (default
>>> behavior, attempts 4 times to avoid transient failure scenario).
>>>
>>> Visit your job's logs in the JobTracker web UI, to find more
>>> information on why your tasks fail.
>>>
>>> On Tue, Nov 20, 2012 at 10:22 PM, jamal sasha 
>>> wrote:
>>> >
>>> >
>>> >
>>> > I am not sure whats happening, but I wrote a simple mapper and reducer
>>> > script.
>>> >
>>> >
>>> >
>>> > And I am testing it against a small dataset (like few lines long).
>>> >
>>> >
>>> >
>>> > For some reason reducer is just not starting.. and mapper is executing
>>> again
>>> > and again?
>>> >
>>> >
>>> >
>>> > 12/11/20 09:21:18 INFO streaming.StreamJob:  map 0%  reduce 0%
>>> >
>>> > 12/11/20 09:22:05 INFO streaming.StreamJob:  map 50%  reduce 0%
>>> >
>>> > 12/11/20 09:22:10 INFO streaming.StreamJob:  map 100%  reduce 0%
>>> >
>>> > 12/11/20 09:32:05 INFO streaming.StreamJob:  map 50%  reduce 0%
>>> >
>>> > 12/11/20 09:32:11 INFO streaming.StreamJob:  map 0%  reduce 0%
>>> >
>>> > 12/11/20 09:32:20 INFO streaming.StreamJob:  map 50%  reduce 0%
>>> >
>>> > 12/11/20 09:32:31 INFO streaming.StreamJob:  map 100%  reduce 0%
>>> >
>>> > 12/11/20 09:42:20 INFO streaming.StreamJob:  map 50%  reduce 0%
>>> >
>>> > 12/11/20 09:42:31 INFO streaming.StreamJob:  map 0%  reduce 0%
>>> >
>>> > 12/11/20 09:42:32 INFO streaming.StreamJob:  map 50%  reduce 0%
>>> >
>>> > 12/11/20 09:42:50 INFO streaming.StreamJob:  map 100%  reduce 0%
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Let me know if you want the code also.
>>> >
>>> > Any clues of where I am going wrong?
>>> >
>>> > Thanks
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>


Re: reducer not starting

2012-11-21 Thread bharath vissapragada
As harsh suggested, you might want to check the task logs on slaves (you
can do it though web UI by clicking on map/reduce task links) and see if
there are any exceptions .


On Wed, Nov 21, 2012 at 8:06 PM, jamal sasha  wrote:

> Hi
>   Thanks for the insights.
> I noticed that these restarts of mappers were because in the shebang i had
> Usr/env/bin instead of usr/env/bin python
> Any clue of what was going on with reducers not starting but mappers being
> executed again and again.
> Probably a very naive question but i am newbie you see :)
>
>
>
> On Wednesday, November 21, 2012, Jean-Marc Spaggiari <
> jean-m...@spaggiari.org> wrote:
> > Just FYI, you don't need to stop the job, update the host, and retry.
> >
> > Just update the host while the job is running and it should retry and
> restart.
> >
> > I had a similar issue with one of my node where the hosts file were
> > not updated. After the updated it has automatically resume the work...
> >
> > JM
> >
> > 2012/11/21, praveenesh kumar :
> >> Sometimes its network issue, reducers are not able to find hostnames or
> IPs
> >> of the other machines. Make sure your /etc/hosts entries and hostnames
> are
> >> correct.
> >>
> >> Regards,
> >> Praveenesh
> >>
> >> On Tue, Nov 20, 2012 at 10:46 PM, Harsh J  wrote:
> >>
> >>> Your mappers are failing (possibly a user-side error or an
> >>> environmental one) and are being reattempted by the framework (default
> >>> behavior, attempts 4 times to avoid transient failure scenario).
> >>>
> >>> Visit your job's logs in the JobTracker web UI, to find more
> >>> information on why your tasks fail.
> >>>
> >>> On Tue, Nov 20, 2012 at 10:22 PM, jamal sasha 
> >>> wrote:
> >>> >
> >>> >
> >>> >
> >>> > I am not sure whats happening, but I wrote a simple mapper and
> reducer
> >>> > script.
> >>> >
> >>> >
> >>> >
> >>> > And I am testing it against a small dataset (like few lines long).
> >>> >
> >>> >
> >>> >
> >>> > For some reason reducer is just not starting.. and mapper is
> executing
> >>> again
> >>> > and again?
> >>> >
> >>> >
> >>> >
> >>> > 12/11/20 09:21:18 INFO streaming.StreamJob:  map 0%  reduce 0%
> >>> >
> >>> > 12/11/20 09:22:05 INFO streaming.StreamJob:  map 50%  reduce 0%
> >>> >
> >>> > 12/11/20 09:22:10 INFO streaming.StreamJob:  map 100%  reduce 0%
> >>> >
> >>> > 12/11/20 09:32:05 INFO streaming.StreamJob:  map 50%  reduce 0%
> >>> >
> >>> > 12/11/20 09:32:11 INFO streaming.StreamJob:  map 0%  reduce 0%
> >>> >
> >>> > 12/11/20 09:32:20 INFO streaming.StreamJob:  map 50%  reduce 0%
> >>> >
> >>> > 12/11/20 09:32:31 INFO streaming.StreamJob:  map 100%  reduce 0%
> >>> >
> >>> > 12/11/20 09:42:20 INFO streaming.StreamJob:  map 50%  reduce 0%
> >>> >
> >>> > 12/11/20 09:42:31 INFO streaming.StreamJob:  map 0%  reduce 0%
> >>> >
> >>> > 12/11/20 09:42:32 INFO streaming.StreamJob:  map 50%  reduce 0%
> >>> >
> >>> > 12/11/20 09:42:50 INFO streaming.StreamJob:  map 100%  reduce 0%
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > Let me know if you want the code also.
> >>> >
> >>> > Any clues of where I am going wrong?
> >>> >
> >>> > Thanks
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>>
> >>
> >
>



-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v


Re: When speculative execution is true, there is a data loss issue with multpleoutputs

2012-11-21 Thread AnilKumar B
Thanks Radim.

Yes, as you said we are not writing into sub-directory of main job. I will
try by making them as sub-directories of output dir.

But one question, when I turn of speculative execution then it is working
fine with same multiple output directory structure. May I know, how exactly
it working in this case?

When we change the speculative execution flag, why exactly there is a
difference in output data?

Thanks,
B Anil Kumar.



On Wed, Nov 21, 2012 at 8:01 PM, Radim Kolar  wrote:

> its not data loss, problem is caused that multipleoutputs do not work with
> standard committer if you do not write into subdirectory of main job output.
>


Re: When speculative execution is true, there is a data loss issue with multpleoutputs

2012-11-21 Thread Radim Kolar

Dne 21.11.2012 16:07, AnilKumar B napsal(a):

Thanks Radim.

Yes, as you said we are not writing into sub-directory of main job. I 
will try by making them as sub-directories of output dir.


But one question, when I turn of speculative execution then it is 
working fine with same multiple output directory structure. May I 
know, how exactly it working in this case?


When we change the speculative execution flag, why exactly there is a 
difference in output data?
because if you are not using multipleoutput then you are not writing to 
real file, but to file with name generated from its task attempt in tmp 
subdirectory. They do not overwrite each other. In HDFS you can have 
only one writer per file.


Re: When speculative execution is true, there is a data loss issue with multpleoutputs

2012-11-21 Thread Radim Kolar
this is another problem with fileoutputformat committer, its related to 
your.


https://issues.apache.org/jira/browse/MAPREDUCE-3772

it works like this: if multipleoutput is relative to job output, then 
there is a workaround to make it work with commiter and outputs from 
multiple tasks do not clash with each other, problem mentioned in ticket 
cheats that relative vs absolute output path detection and all output is 
lost on task commit.


But if output is absolute path, then its written directly to output file 
which fails because writers from multiple attempts crash together.


io.file.buffer.size

2012-11-21 Thread Kartashov, Andy
Guys,

I've read that increasing above (default 4kb) number to, say 128kb, might speed 
things up.

My input is 40mln serialised records coming from RDMS and I noticed that with 
increased IO my job actually runs a tiny bit slower. Is that possible?

p.s. got two questions:
1. During Sqoop import I see that two additional files are generated in the 
HDFS folder, namely
.../_log/history/...conf.xml
.../_log/history/...sqoop_generated_class.jar
Is there a way to redirect these files to a different directory? I cannot find 
an answer.

2. I run multiple reducers and each generate each own output. If I was to merge 
all the output, will running either of the below commands be recommended?

hadoop dfs -getmerge  
or
hadoop dfs -cat output/* > output_All
hadoop dfs -get output_All 

Thanks,
AK


NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont 
confidentiels, protégés par le droit d'auteur et peuvent être couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autorisée est 
interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, 
supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à 
l'environnement avant d'imprimer le présent courriel


guessing number of reducers.

2012-11-21 Thread jamal sasha
By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data...
say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be
empty.So i am guessing i need loads of reducers but then most of them will
be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks


RE: guessing number of reducers.

2012-11-21 Thread Kartashov, Andy
Jamal,

This is what I am using...

After you start your job, visit jobtracker's WebUI :50030
And look for Cluster summary. Reduce Task Capacity shall hint you what 
optimally set your number to. I could be wrong but it works for me. :)
Cluster Summary (Heap Size is *** MB/966.69 MB)
Running Map Tasks

Running Reduce Tasks

Total Submissions

Nodes

Occupied Map Slots

Occupied Reduce Slots

Reserved Map Slots

Reserved Reduce Slots

Map Task Capacity

Reduce Task Capacity

Avg. Tasks/Node

Blacklisted Nodes

Excluded Nodes



Rgds,
AK47

From: jamal sasha [mailto:jamalsha...@gmail.com]
Sent: Wednesday, November 21, 2012 11:39 AM
To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say 
average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be 
empty.So i am guessing i need loads of reducers but then most of them will be 
empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont 
confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est 
interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, 
supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? 
l'environnement avant d'imprimer le pr?sent courriel


Re: guessing number of reducers.

2012-11-21 Thread Bejoy KS
Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume 
to reduce phase. In tools like hive and pig by default for every 1GB of map 
output there will be a reducer. So if you have 100 gigs of map output then 100 
reducers.
If your tasks are more CPU intensive then you need lesser volume of data per 
reducer for better performance results. 

In general it is better to have the number of reduce tasks slightly less than 
the number of available reduce slots in the cluster.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: jamal sasha 
Date: Wed, 21 Nov 2012 11:38:38 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data...
say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be
empty.So i am guessing i need loads of reducers but then most of them will
be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks



RE: guessing number of reducers.

2012-11-21 Thread Kartashov, Andy
Bejoy,

I've read somethere about keeping number of mapred.reduce.tasks below the 
reduce task capcity. Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)

AK47


From: Bejoy KS [mailto:bejoy.had...@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume 
to reduce phase. In tools like hive and pig by default for every 1GB of map 
output there will be a reducer. So if you have 100 gigs of map output then 100 
reducers.
If your tasks are more CPU intensive then you need lesser volume of data per 
reducer for better performance results.

In general it is better to have the number of reduce tasks slightly less than 
the number of available reduce slots in the cluster.
Regards
Bejoy KS

Sent from handheld, please excuse typos.

From: jamal sasha 
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say 
average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be 
empty.So i am guessing i need loads of reducers but then most of them will be 
empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont 
confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est 
interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, 
supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? 
l'environnement avant d'imprimer le pr?sent courriel


Re: guessing number of reducers.

2012-11-21 Thread Manoj Babu
Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.had...@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  --
>
> *From: *jamal sasha 
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>


Get the name of node where mapper is running

2012-11-21 Thread Eduard Skaley

Hallo guys,

how can i find out on which node a mapper is running ?

Thx
Eduard


Re: guessing number of reducers.

2012-11-21 Thread Mohammad Tariq
Hello Jamal,

   I use a different approach based on the no of cores. If you have, say a
4 cores machine then you can have (0.75*no cores)no.  of MR slots.
For example, if you have 4 physical cores OR 8 virtual cores then you can
have 0.75*8=6 MR slots. You can then set 3M+3R or 4M+2R and so on as per
your requirement.

Regards,
Mohammad Tariq



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.had...@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  --
>
> *From: *jamal sasha 
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>


Re: Get the name of node where mapper is running

2012-11-21 Thread Kai Voigt
Hello,

the JobTracker has a built-in Web UI (http://hostname_of_jobtracker:50030/) 
where you can get details for all completed and running jobs. For the map 
phase, it will tell you on which physical hosts the tasks were executed.

Kai

Am 21.11.2012 um 19:04 schrieb Eduard Skaley :

> Hallo guys,
> 
> how can i find out on which node a mapper is running ?
> 
> Thx
> Eduard
> 

-- 
Kai Voigt
k...@123.org






Re: Facebook corona compatibility

2012-11-21 Thread Robert Molina
Hi Amit,
There is a mention here to Start in the hadoop-20 parent path :
https://github.com/facebook/hadoop-20/wiki/Corona-Single-Node-Setup

Regards,
Rob

On Mon, Nov 12, 2012 at 8:01 AM, Amit Sela  wrote:

> Hi everyone,
>
> Anyone knows if the new corona tools (Facebook just released as open
> source) are compatible with hadoop 1.0.x ? or just 0.20.x ?
>
> Thanks.
>


Re: guessing number of reducers.

2012-11-21 Thread Bejoy KS
Hi Andy

It is usually so because if you have more reduce tasks than the reduce slots in 
your cluster then a few of the reduce tasks will be in queue waiting for its 
turn. So it is better to keep the num of reduce tasks slightly less than the 
reduce task capacity so that all reduce tasks run at once in parallel.

But in some cases each reducer can process only certain volume of data due to 
some constraints, like data beyond a certain limit may lead to OOMs. In such 
cases you may need to configure the number of reducers totally based on your 
data and not based on slots.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: "Kartashov, Andy" 
Date: Wed, 21 Nov 2012 17:49:50 
To: user@hadoop.apache.org; 
bejoy.had...@gmail.com
Subject: RE: guessing number of reducers.

Bejoy,

I've read somethere about keeping number of mapred.reduce.tasks below the 
reduce task capcity. Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)

AK47


From: Bejoy KS [mailto:bejoy.had...@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume 
to reduce phase. In tools like hive and pig by default for every 1GB of map 
output there will be a reducer. So if you have 100 gigs of map output then 100 
reducers.
If your tasks are more CPU intensive then you need lesser volume of data per 
reducer for better performance results.

In general it is better to have the number of reduce tasks slightly less than 
the number of available reduce slots in the cluster.
Regards
Bejoy KS

Sent from handheld, please excuse typos.

From: jamal sasha 
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say 
average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be 
empty.So i am guessing i need loads of reducers but then most of them will be 
empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont 
confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est 
interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, 
supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? 
l'environnement avant d'imprimer le pr?sent courriel



Re: guessing number of reducers.

2012-11-21 Thread jamal sasha
Thanks for the input guys. This helps alot
:)

On Wednesday, November 21, 2012, Bejoy KS  wrote:
> Hi Andy
>
> It is usually so because if you have more reduce tasks than the reduce
slots in your cluster then a few of the reduce tasks will be in queue
waiting for its turn. So it is better to keep the num of reduce tasks
slightly less than the reduce task capacity so that all reduce tasks run at
once in parallel.
>
> But in some cases each reducer can process only certain volume of data
due to some constraints, like data beyond a certain limit may lead to OOMs.
In such cases you may need to configure the number of reducers totally
based on your data and not based on slots.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> 
> From: "Kartashov, Andy" 
> Date: Wed, 21 Nov 2012 17:49:50 +
> To: user@hadoop.apache.org; bejoy.had...@gmail.com

> Subject: RE: guessing number of reducers.
>
> Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
clos. J
>
>
>
> AK47
>
>
>
>
>
> From: Bejoy KS [mailto:bejoy.had...@gmail.com]
> Sent: Wednesday, November 21, 2012 11:51 AM
> To: user@hadoop.apache.org
> Subject: Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
volume to reduce phase. In tools like hive and pig by default for every 1GB
of map output there will be a reducer. So if you have 100 gigs of map
output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> 
>
> From: jamal sasha 
>
> Date: Wed, 21 Nov 2012 11:38:38 -0500
>
> To: user@hadoop.apache.org


Re: guessing number of reducers.

2012-11-21 Thread Bejoy KS
Hi Manoj

If you intend to calculate the number of reducers based on the input size, then 
in your driver class you should get the size of the input dir in hdfs and  say 
you intended to give n bytes to a reducer then the number of reducers can be 
computed as
Total input size/ bytes per reducer.

You can round this value and use it to set the number of reducers in conf 
programatically.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Manoj Babu 
Date: Wed, 21 Nov 2012 23:28:00 
To: 
Cc: bejoy.had...@gmail.com
Subject: Re: guessing number of reducers.

Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.had...@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  --
>
> *From: *jamal sasha 
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>



MapReduce logs

2012-11-21 Thread Jean-Marc Spaggiari
Hi,

When we run a MapReduce job, the logs are stored on all the tasktracker nodes.

Is there an easy way to agregate all those logs together and see them
in a single place instead of going to the tasks one by one and open
the file?

Thanks,

JM


Re: MapReduce logs

2012-11-21 Thread Dino Kečo
Hi,

We had similar requirement and we built small Java application which gets
information about task nodes from Job Tracker and download logs into one
file using URLs of each task tracker.

For huge logs this becomes slow and time consuming.

Hope this helps.

Regards,
Dino Kečo
msn: xdi...@hotmail.com
mail: dino.k...@gmail.com
skype: dino.keco
phone: +387 61 507 851


On Wed, Nov 21, 2012 at 7:55 PM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> Hi,
>
> When we run a MapReduce job, the logs are stored on all the tasktracker
> nodes.
>
> Is there an easy way to agregate all those logs together and see them
> in a single place instead of going to the tasks one by one and open
> the file?
>
> Thanks,
>
> JM
>


fundamental doubt

2012-11-21 Thread jamal sasha
Hi..
I guess i am asking alot of fundamental questions but i thank you guys for
taking out time to explain my doubts.
So i am able to write map reduce jobs but here is my mydoubt
As of now i am writing mappers which emit key and a value
This key value is then captured at reducer end and then i process the key
and value there.
Let's say i want to calculate the average...
Key1 value1
Key2 value 2
Key 1 value 3

So the output is something like
Key1 average of value  1 and value 3
Key2 average 2 = value 2

Right now in reducer i have to create a dictionary with key as original
keys and value is a list.
Data = defaultdict(list) == // python usrr
But i thought that
Mapper takes in the key value pairs and outputs key: ( v1,v2)and
Reducer takes in this key and list of values and returns
Key , new value..

So why is the input of reducer the simple output of mapper and not the list
of all the values to a particular key or did i  understood something.
Am i making any sense ??


Re: MapReduce logs

2012-11-21 Thread Jean-Marc Spaggiari
Thanks for the info.

I have quickly draft this bash script in case it can help someone...
You just neeed to make sure the IP inside is replaced.
To call it, you need to give the job task page.

./showLogs.sh 
"http://192.168.23.7:50030/jobtasks.jsp?jobid=job_201211211408_0001&type=map&pagenum=1";

Then you can redirect the output, or do what ever you want.

I was wondering if there was a "nicer" solution...

:~/test$ cat showLogs.sh
#!/bin/bash
rm -f tasks.html
wget --quiet --output-document tasks.html $1
for i in `cat tasks.html | grep taskdetails | cut -d"\"" -f2 | grep
taskdetails`; do
rm -f tasksdetails.html
wget --quiet --output-document tasksdetails.html 
http://192.168.23.7:50030/$i
for j in `cat tasksdetails.html | grep "all=true" | cut -d"\"" -f6`; do
printf "*"%.0s {1..80}
echo
echo $j
printf "*"%.0s {1..80}
echo
rm -f logs.txt
wget --quiet --output-document logs.txt $j
tail -n +31 logs.txt | head -n -2
done
done
rm -f tasks.html
rm -f tasksdetails.html
rm -f logs.txt


2012/11/21, Dino Kečo :
> Hi,
>
> We had similar requirement and we built small Java application which gets
> information about task nodes from Job Tracker and download logs into one
> file using URLs of each task tracker.
>
> For huge logs this becomes slow and time consuming.
>
> Hope this helps.
>
> Regards,
> Dino Kečo
> msn: xdi...@hotmail.com
> mail: dino.k...@gmail.com
> skype: dino.keco
> phone: +387 61 507 851
>
>
> On Wed, Nov 21, 2012 at 7:55 PM, Jean-Marc Spaggiari <
> jean-m...@spaggiari.org> wrote:
>
>> Hi,
>>
>> When we run a MapReduce job, the logs are stored on all the tasktracker
>> nodes.
>>
>> Is there an easy way to agregate all those logs together and see them
>> in a single place instead of going to the tasks one by one and open
>> the file?
>>
>> Thanks,
>>
>> JM
>>
>


Re: fundamental doubt

2012-11-21 Thread Mohammad Tariq
Hello Jamal,

 For efficient processing all the values associated with the same key
get sorted and go to same reducer. As a result the reducer gets a key and a
list of values as its input. To me your assumption seems correct.

Regards,
Mohammad Tariq



On Thu, Nov 22, 2012 at 1:20 AM, jamal sasha  wrote:

> Hi..
> I guess i am asking alot of fundamental questions but i thank you guys for
> taking out time to explain my doubts.
> So i am able to write map reduce jobs but here is my mydoubt
> As of now i am writing mappers which emit key and a value
> This key value is then captured at reducer end and then i process the key
> and value there.
> Let's say i want to calculate the average...
> Key1 value1
> Key2 value 2
> Key 1 value 3
>
> So the output is something like
> Key1 average of value  1 and value 3
> Key2 average 2 = value 2
>
> Right now in reducer i have to create a dictionary with key as original
> keys and value is a list.
> Data = defaultdict(list) == // python usrr
> But i thought that
> Mapper takes in the key value pairs and outputs key: ( v1,v2)and
> Reducer takes in this key and list of values and returns
> Key , new value..
>
> So why is the input of reducer the simple output of mapper and not the
> list of all the values to a particular key or did i  understood something.
> Am i making any sense ??


Re: fundamental doubt

2012-11-21 Thread Bejoy KS
Hi Jamal

It is performed at a frame work level map emits key value pairs and the 
framework collects and groups all the values corresponding to a key from all 
the map tasks. Now the reducer takes the input as a key and a collection of 
values only. The reduce method signature defines it.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: jamal sasha 
Date: Wed, 21 Nov 2012 14:50:51 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: fundamental doubt

Hi..
I guess i am asking alot of fundamental questions but i thank you guys for
taking out time to explain my doubts.
So i am able to write map reduce jobs but here is my mydoubt
As of now i am writing mappers which emit key and a value
This key value is then captured at reducer end and then i process the key
and value there.
Let's say i want to calculate the average...
Key1 value1
Key2 value 2
Key 1 value 3

So the output is something like
Key1 average of value  1 and value 3
Key2 average 2 = value 2

Right now in reducer i have to create a dictionary with key as original
keys and value is a list.
Data = defaultdict(list) == // python usrr
But i thought that
Mapper takes in the key value pairs and outputs key: ( v1,v2)and
Reducer takes in this key and list of values and returns
Key , new value..

So why is the input of reducer the simple output of mapper and not the list
of all the values to a particular key or did i  understood something.
Am i making any sense ??



Re: fundamental doubt

2012-11-21 Thread jamal sasha
got it.
thanks for clarification


On Wed, Nov 21, 2012 at 3:03 PM, Bejoy KS  wrote:

> **
> Hi Jamal
>
> It is performed at a frame work level map emits key value pairs and the
> framework collects and groups all the values corresponding to a key from
> all the map tasks. Now the reducer takes the input as a key and a
> collection of values only. The reduce method signature defines it.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> --
> *From: * jamal sasha 
> *Date: *Wed, 21 Nov 2012 14:50:51 -0500
> *To: *user@hadoop.apache.org
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *fundamental doubt
>
> Hi..
> I guess i am asking alot of fundamental questions but i thank you guys for
> taking out time to explain my doubts.
> So i am able to write map reduce jobs but here is my mydoubt
> As of now i am writing mappers which emit key and a value
> This key value is then captured at reducer end and then i process the key
> and value there.
> Let's say i want to calculate the average...
> Key1 value1
> Key2 value 2
> Key 1 value 3
>
> So the output is something like
> Key1 average of value  1 and value 3
> Key2 average 2 = value 2
>
> Right now in reducer i have to create a dictionary with key as original
> keys and value is a list.
> Data = defaultdict(list) == // python usrr
> But i thought that
> Mapper takes in the key value pairs and outputs key: ( v1,v2)and
> Reducer takes in this key and list of values and returns
> Key , new value..
>
> So why is the input of reducer the simple output of mapper and not the
> list of all the values to a particular key or did i  understood something.
> Am i making any sense ??
>


Re: Pentaho

2012-11-21 Thread Harsh J
A better place to ask this at, is at the Pentaho's own community
http://wiki.pentaho.com/display/BAD/Pentaho+Big+Data+Community+Home.
At a glance, they have forums and IRC you could use to ask your
questions about their product.

On Wed, Nov 21, 2012 at 11:40 PM, suneel hadoop
 wrote:
>
> Hi all,
> Any material available on pentaho kettle
> Thanks,
> Suneel...



-- 
Harsh J


Re: Facebook corona compatibility

2012-11-21 Thread Harsh J
IIRC, Facebook's own hadoop branch (Github: facebook/hadoop I guess),
does not support or carry any security features, which Apache Hadoop
0.20.203 -> 1.1.x now carries. So out of the box, I expect it to be
incompatible with any of the recent Apache releases.

On Mon, Nov 12, 2012 at 9:31 PM, Amit Sela  wrote:
> Hi everyone,
>
> Anyone knows if the new corona tools (Facebook just released as open source)
> are compatible with hadoop 1.0.x ? or just 0.20.x ?
>
> Thanks.



-- 
Harsh J


Re: guessing number of reducers.

2012-11-21 Thread Manoj Babu
Thank you for the info Bejoy.

Cheers!
Manoj.



On Thu, Nov 22, 2012 at 12:04 AM, Bejoy KS  wrote:

> **
> Hi Manoj
>
> If you intend to calculate the number of reducers based on the input size,
> then in your driver class you should get the size of the input dir in hdfs
> and say you intended to give n bytes to a reducer then the number of
> reducers can be computed as
> Total input size/ bytes per reducer.
>
> You can round this value and use it to set the number of reducers in conf
> programatically.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> --
> *From: * Manoj Babu 
> *Date: *Wed, 21 Nov 2012 23:28:00 +0530
> *To: *
> *Cc: *bejoy.had...@gmail.com
> *Subject: *Re: guessing number of reducers.
>
> Hi,
>
> How to set no of reducers in job conf dynamically?
> For example some days i am getting 500GB of data on heavy traffic and some
> days 100GB only.
>
> Thanks in advance!
>
> Cheers!
> Manoj.
>
>
>
> On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy 
> wrote:
>
>>  Bejoy,
>>
>>
>>
>> I’ve read somethere about keeping number of mapred.reduce.tasks below the
>> reduce task capcity. Here is what I just tested:
>>
>>
>>
>> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>>
>>
>>
>> 1 Reducer   – 22mins
>>
>> 4 Reducers – 11.5mins
>>
>> 8 Reducers – 5mins
>>
>> 10 Reducers – 7mins
>>
>> 12 Reducers – 6:5mins
>>
>> 16 Reducers – 5.5mins
>>
>>
>>
>> 8 Reducers have won the race. But Reducers at the max capacity was very
>> clos. J
>>
>>
>>
>> AK47
>>
>>
>>
>>
>>
>> *From:* Bejoy KS [mailto:bejoy.had...@gmail.com]
>> *Sent:* Wednesday, November 21, 2012 11:51 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: guessing number of reducers.
>>
>>
>>
>> Hi Sasha
>>
>> In general the number of reduce tasks is chosen mainly based on the data
>> volume to reduce phase. In tools like hive and pig by default for every 1GB
>> of map output there will be a reducer. So if you have 100 gigs of map
>> output then 100 reducers.
>> If your tasks are more CPU intensive then you need lesser volume of data
>> per reducer for better performance results.
>>
>> In general it is better to have the number of reduce tasks slightly less
>> than the number of available reduce slots in the cluster.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>>  --
>>
>> *From: *jamal sasha 
>>
>> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>>
>> *To: *user@hadoop.apache.org
>>
>> *ReplyTo: *user@hadoop.apache.org
>>
>> *Subject: *guessing number of reducers.
>>
>>
>>
>> By default the number of reducers is set to 1..
>> Is there a good way to guess optimal number of reducers
>> Or let's say i have tbs worth of data... mappers are of order 5000 or
>> so...
>> But ultimately i am calculating , let's say, some average of whole
>> data... say average transaction occurring...
>> Now the output will be just one line in one "part"... rest of them will
>> be empty.So i am guessing i need loads of reducers but then most of them
>> will be empty but at the same time one reducer won't suffice..
>> What's the best way to solve this..
>> How to guess optimal number of reducers..
>> Thanks
>>  NOTICE: This e-mail message and any attachments are confidential,
>> subject to copyright and may be privileged. Any unauthorized use, copying
>> or disclosure is prohibited. If you are not the intended recipient, please
>> delete and contact the sender immediately. Please consider the environment
>> before printing this e-mail. AVIS : le présent courriel et toute pièce
>> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
>> et peuvent être couverts par le secret professionnel. Toute utilisation,
>> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
>> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
>> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
>> courriel
>>
>
>


Re: Hadoop Web Interface Security

2012-11-21 Thread Visioner Sadak
thanks harsh any hints on how to give user.name in configuration files
for simple authentication,is that given as a property

On Wed, Nov 21, 2012 at 5:52 PM, Harsh J  wrote:

> Yes, see
> http://hadoop.apache.org/docs/current/hadoop-auth/Configuration.html
> and also see http://hadoop.apache.org/docs/stable/HttpAuthentication.html
>
> On Wed, Nov 21, 2012 at 3:34 PM, Visioner Sadak
>  wrote:
> > Hi as we knw that by using hadoop's web UI  at
> http://namenode-ip/50070
> > anyone can access the hdfs details can we secure it only to certain
> > authorized users and not publicly to all.. in production
>
>
>
> --
> Harsh J
>


RE: HADOOP UPGRADE ISSUE

2012-11-21 Thread Uma Maheswara Rao G
start-all.sh will not carry any arguments to pass to nodes.

Start with start-dfs.sh

or start directly namenode with upgrade option. ./hadoop namenode -upgrade



Regards,

Uma


From: yogesh dhari [yogeshdh...@live.com]
Sent: Thursday, November 22, 2012 12:23 PM
To: hadoop helpforoum
Subject: HADOOP UPGRADE ISSUE

Hi All,

I am trying upgrading apache hadoop-0.20.2 to hadoop-1.0.4.
I have give same dfs.name.dir, etc as same in hadoop-1.0.4' conf files as were 
in hadoop-0.20.2.
Now I am starting dfs n mapred using

start-all.sh -upgrade

but namenode and datanode fail to run.

1) Namenode's logs shows::

ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem 
initialization failed.
java.io.IOException:
File system image contains an old layout version -18.
An upgrade to version -32 is required.
Please restart NameNode with -upgrade option.
.
.
ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
File system image contains an old layout version -18.
An upgrade to version -32 is required.
Please restart NameNode with -upgrade option.


2) Datanode's logs shows::

WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Invalid directory in 
dfs.data.dir: Incorrect permission for /opt/hadoop_newdata_dirr, expected: 
rwxr-xr-x, while actual: rwxrwxrwx
(  how these file permission showing warnings)*

2012-11-22 12:05:21,157 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
All directories in dfs.data.dir are invalid.

Please suggest

Thanks & Regards
Yogesh Kumar