RE: Is there an additional overhead when storing data in HDFS?
Thank guys, great job. From: donta...@gmail.com Date: Wed, 21 Nov 2012 13:23:08 +0530 Subject: Re: Is there an additional overhead when storing data in HDFS? To: user@hadoop.apache.org Hello Ramon, Why don't you go through this link once : http://www.aosabook.org/en/hdfs.htmlSuresh and guys have explained everything beautifully. HTHRegards,Mohammad Tariq On Wed, Nov 21, 2012 at 12:58 PM, Suresh Srinivas wrote: Namenode will have trivial amount of data stored in journal/fsimage. On Tue, Nov 20, 2012 at 11:21 PM, WangRamon wrote: Thanks, besides the checksum data is there anything else? Data in name node? Date: Tue, 20 Nov 2012 23:14:06 -0800 Subject: Re: Is there an additional overhead when storing data in HDFS? From: sur...@hortonworks.com To: user@hadoop.apache.org HDFS uses 4GB for the file + checksum data. Default is for every 512 bytes of data, 4 bytes of checksum are stored. In this case additional 32MB data. On Tue, Nov 20, 2012 at 11:00 PM, WangRamon wrote: Hi All I'm wondering if there is an additional overhead when storing some data into HDFS? For example, I have a 2GB file, the replicate factor of HDSF is 2, when the file is uploaded to HDFS, should HDFS use 4GB to store it or more then 4GB to store it? If it takes more than 4GB space, why? Thanks Ramon -- http://hortonworks.com/download/ -- http://hortonworks.com/download/
Re: reducer not starting
Sometimes its network issue, reducers are not able to find hostnames or IPs of the other machines. Make sure your /etc/hosts entries and hostnames are correct. Regards, Praveenesh On Tue, Nov 20, 2012 at 10:46 PM, Harsh J wrote: > Your mappers are failing (possibly a user-side error or an > environmental one) and are being reattempted by the framework (default > behavior, attempts 4 times to avoid transient failure scenario). > > Visit your job's logs in the JobTracker web UI, to find more > information on why your tasks fail. > > On Tue, Nov 20, 2012 at 10:22 PM, jamal sasha > wrote: > > > > > > > > I am not sure whats happening, but I wrote a simple mapper and reducer > > script. > > > > > > > > And I am testing it against a small dataset (like few lines long). > > > > > > > > For some reason reducer is just not starting.. and mapper is executing > again > > and again? > > > > > > > > 12/11/20 09:21:18 INFO streaming.StreamJob: map 0% reduce 0% > > > > 12/11/20 09:22:05 INFO streaming.StreamJob: map 50% reduce 0% > > > > 12/11/20 09:22:10 INFO streaming.StreamJob: map 100% reduce 0% > > > > 12/11/20 09:32:05 INFO streaming.StreamJob: map 50% reduce 0% > > > > 12/11/20 09:32:11 INFO streaming.StreamJob: map 0% reduce 0% > > > > 12/11/20 09:32:20 INFO streaming.StreamJob: map 50% reduce 0% > > > > 12/11/20 09:32:31 INFO streaming.StreamJob: map 100% reduce 0% > > > > 12/11/20 09:42:20 INFO streaming.StreamJob: map 50% reduce 0% > > > > 12/11/20 09:42:31 INFO streaming.StreamJob: map 0% reduce 0% > > > > 12/11/20 09:42:32 INFO streaming.StreamJob: map 50% reduce 0% > > > > 12/11/20 09:42:50 INFO streaming.StreamJob: map 100% reduce 0% > > > > > > > > > > > > Let me know if you want the code also. > > > > Any clues of where I am going wrong? > > > > Thanks > > > > > > > > > > > > > > > > -- > Harsh J >
Re: Hadoop Web Interface Security
Yes, see http://hadoop.apache.org/docs/current/hadoop-auth/Configuration.html and also see http://hadoop.apache.org/docs/stable/HttpAuthentication.html On Wed, Nov 21, 2012 at 3:34 PM, Visioner Sadak wrote: > Hi as we knw that by using hadoop's web UI at http://namenode-ip/50070 > anyone can access the hdfs details can we secure it only to certain > authorized users and not publicly to all.. in production -- Harsh J
Re: reducer not starting
Just FYI, you don't need to stop the job, update the host, and retry. Just update the host while the job is running and it should retry and restart. I had a similar issue with one of my node where the hosts file were not updated. After the updated it has automatically resume the work... JM 2012/11/21, praveenesh kumar : > Sometimes its network issue, reducers are not able to find hostnames or IPs > of the other machines. Make sure your /etc/hosts entries and hostnames are > correct. > > Regards, > Praveenesh > > On Tue, Nov 20, 2012 at 10:46 PM, Harsh J wrote: > >> Your mappers are failing (possibly a user-side error or an >> environmental one) and are being reattempted by the framework (default >> behavior, attempts 4 times to avoid transient failure scenario). >> >> Visit your job's logs in the JobTracker web UI, to find more >> information on why your tasks fail. >> >> On Tue, Nov 20, 2012 at 10:22 PM, jamal sasha >> wrote: >> > >> > >> > >> > I am not sure whats happening, but I wrote a simple mapper and reducer >> > script. >> > >> > >> > >> > And I am testing it against a small dataset (like few lines long). >> > >> > >> > >> > For some reason reducer is just not starting.. and mapper is executing >> again >> > and again? >> > >> > >> > >> > 12/11/20 09:21:18 INFO streaming.StreamJob: map 0% reduce 0% >> > >> > 12/11/20 09:22:05 INFO streaming.StreamJob: map 50% reduce 0% >> > >> > 12/11/20 09:22:10 INFO streaming.StreamJob: map 100% reduce 0% >> > >> > 12/11/20 09:32:05 INFO streaming.StreamJob: map 50% reduce 0% >> > >> > 12/11/20 09:32:11 INFO streaming.StreamJob: map 0% reduce 0% >> > >> > 12/11/20 09:32:20 INFO streaming.StreamJob: map 50% reduce 0% >> > >> > 12/11/20 09:32:31 INFO streaming.StreamJob: map 100% reduce 0% >> > >> > 12/11/20 09:42:20 INFO streaming.StreamJob: map 50% reduce 0% >> > >> > 12/11/20 09:42:31 INFO streaming.StreamJob: map 0% reduce 0% >> > >> > 12/11/20 09:42:32 INFO streaming.StreamJob: map 50% reduce 0% >> > >> > 12/11/20 09:42:50 INFO streaming.StreamJob: map 100% reduce 0% >> > >> > >> > >> > >> > >> > Let me know if you want the code also. >> > >> > Any clues of where I am going wrong? >> > >> > Thanks >> > >> > >> > >> > >> > >> > >> >> >> >> -- >> Harsh J >> >
Not able to change the priority of job using fair scheduler
Hi, I have enabled the fair scheduler and everything is set to default with only few configuration changes. It is working fine and multiple users can run queries simultaneously. But I am not able to change the priority from " *http:///scheduler*" . Priority column is coming as a simple test not with a drag down column. Same thing with pool column. Which configuration I missed for this. http://hadoop.apache.org/docs/r0.20.2/fair_scheduler.html Above link doesn't say anything about this. Please help me. Thanks, Chunky.
Re: When speculative execution is true, there is a data loss issue with multpleoutputs
its not data loss, problem is caused that multipleoutputs do not work with standard committer if you do not write into subdirectory of main job output.
Re: reducer not starting
Hi Thanks for the insights. I noticed that these restarts of mappers were because in the shebang i had Usr/env/bin instead of usr/env/bin python Any clue of what was going on with reducers not starting but mappers being executed again and again. Probably a very naive question but i am newbie you see :) On Wednesday, November 21, 2012, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > Just FYI, you don't need to stop the job, update the host, and retry. > > Just update the host while the job is running and it should retry and restart. > > I had a similar issue with one of my node where the hosts file were > not updated. After the updated it has automatically resume the work... > > JM > > 2012/11/21, praveenesh kumar : >> Sometimes its network issue, reducers are not able to find hostnames or IPs >> of the other machines. Make sure your /etc/hosts entries and hostnames are >> correct. >> >> Regards, >> Praveenesh >> >> On Tue, Nov 20, 2012 at 10:46 PM, Harsh J wrote: >> >>> Your mappers are failing (possibly a user-side error or an >>> environmental one) and are being reattempted by the framework (default >>> behavior, attempts 4 times to avoid transient failure scenario). >>> >>> Visit your job's logs in the JobTracker web UI, to find more >>> information on why your tasks fail. >>> >>> On Tue, Nov 20, 2012 at 10:22 PM, jamal sasha >>> wrote: >>> > >>> > >>> > >>> > I am not sure whats happening, but I wrote a simple mapper and reducer >>> > script. >>> > >>> > >>> > >>> > And I am testing it against a small dataset (like few lines long). >>> > >>> > >>> > >>> > For some reason reducer is just not starting.. and mapper is executing >>> again >>> > and again? >>> > >>> > >>> > >>> > 12/11/20 09:21:18 INFO streaming.StreamJob: map 0% reduce 0% >>> > >>> > 12/11/20 09:22:05 INFO streaming.StreamJob: map 50% reduce 0% >>> > >>> > 12/11/20 09:22:10 INFO streaming.StreamJob: map 100% reduce 0% >>> > >>> > 12/11/20 09:32:05 INFO streaming.StreamJob: map 50% reduce 0% >>> > >>> > 12/11/20 09:32:11 INFO streaming.StreamJob: map 0% reduce 0% >>> > >>> > 12/11/20 09:32:20 INFO streaming.StreamJob: map 50% reduce 0% >>> > >>> > 12/11/20 09:32:31 INFO streaming.StreamJob: map 100% reduce 0% >>> > >>> > 12/11/20 09:42:20 INFO streaming.StreamJob: map 50% reduce 0% >>> > >>> > 12/11/20 09:42:31 INFO streaming.StreamJob: map 0% reduce 0% >>> > >>> > 12/11/20 09:42:32 INFO streaming.StreamJob: map 50% reduce 0% >>> > >>> > 12/11/20 09:42:50 INFO streaming.StreamJob: map 100% reduce 0% >>> > >>> > >>> > >>> > >>> > >>> > Let me know if you want the code also. >>> > >>> > Any clues of where I am going wrong? >>> > >>> > Thanks >>> > >>> > >>> > >>> > >>> > >>> > >>> >>> >>> >>> -- >>> Harsh J >>> >> >
Re: reducer not starting
As harsh suggested, you might want to check the task logs on slaves (you can do it though web UI by clicking on map/reduce task links) and see if there are any exceptions . On Wed, Nov 21, 2012 at 8:06 PM, jamal sasha wrote: > Hi > Thanks for the insights. > I noticed that these restarts of mappers were because in the shebang i had > Usr/env/bin instead of usr/env/bin python > Any clue of what was going on with reducers not starting but mappers being > executed again and again. > Probably a very naive question but i am newbie you see :) > > > > On Wednesday, November 21, 2012, Jean-Marc Spaggiari < > jean-m...@spaggiari.org> wrote: > > Just FYI, you don't need to stop the job, update the host, and retry. > > > > Just update the host while the job is running and it should retry and > restart. > > > > I had a similar issue with one of my node where the hosts file were > > not updated. After the updated it has automatically resume the work... > > > > JM > > > > 2012/11/21, praveenesh kumar : > >> Sometimes its network issue, reducers are not able to find hostnames or > IPs > >> of the other machines. Make sure your /etc/hosts entries and hostnames > are > >> correct. > >> > >> Regards, > >> Praveenesh > >> > >> On Tue, Nov 20, 2012 at 10:46 PM, Harsh J wrote: > >> > >>> Your mappers are failing (possibly a user-side error or an > >>> environmental one) and are being reattempted by the framework (default > >>> behavior, attempts 4 times to avoid transient failure scenario). > >>> > >>> Visit your job's logs in the JobTracker web UI, to find more > >>> information on why your tasks fail. > >>> > >>> On Tue, Nov 20, 2012 at 10:22 PM, jamal sasha > >>> wrote: > >>> > > >>> > > >>> > > >>> > I am not sure whats happening, but I wrote a simple mapper and > reducer > >>> > script. > >>> > > >>> > > >>> > > >>> > And I am testing it against a small dataset (like few lines long). > >>> > > >>> > > >>> > > >>> > For some reason reducer is just not starting.. and mapper is > executing > >>> again > >>> > and again? > >>> > > >>> > > >>> > > >>> > 12/11/20 09:21:18 INFO streaming.StreamJob: map 0% reduce 0% > >>> > > >>> > 12/11/20 09:22:05 INFO streaming.StreamJob: map 50% reduce 0% > >>> > > >>> > 12/11/20 09:22:10 INFO streaming.StreamJob: map 100% reduce 0% > >>> > > >>> > 12/11/20 09:32:05 INFO streaming.StreamJob: map 50% reduce 0% > >>> > > >>> > 12/11/20 09:32:11 INFO streaming.StreamJob: map 0% reduce 0% > >>> > > >>> > 12/11/20 09:32:20 INFO streaming.StreamJob: map 50% reduce 0% > >>> > > >>> > 12/11/20 09:32:31 INFO streaming.StreamJob: map 100% reduce 0% > >>> > > >>> > 12/11/20 09:42:20 INFO streaming.StreamJob: map 50% reduce 0% > >>> > > >>> > 12/11/20 09:42:31 INFO streaming.StreamJob: map 0% reduce 0% > >>> > > >>> > 12/11/20 09:42:32 INFO streaming.StreamJob: map 50% reduce 0% > >>> > > >>> > 12/11/20 09:42:50 INFO streaming.StreamJob: map 100% reduce 0% > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > Let me know if you want the code also. > >>> > > >>> > Any clues of where I am going wrong? > >>> > > >>> > Thanks > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > >>> > >>> > >>> -- > >>> Harsh J > >>> > >> > > > -- Regards, Bharath .V w:http://researchweb.iiit.ac.in/~bharath.v
Re: When speculative execution is true, there is a data loss issue with multpleoutputs
Thanks Radim. Yes, as you said we are not writing into sub-directory of main job. I will try by making them as sub-directories of output dir. But one question, when I turn of speculative execution then it is working fine with same multiple output directory structure. May I know, how exactly it working in this case? When we change the speculative execution flag, why exactly there is a difference in output data? Thanks, B Anil Kumar. On Wed, Nov 21, 2012 at 8:01 PM, Radim Kolar wrote: > its not data loss, problem is caused that multipleoutputs do not work with > standard committer if you do not write into subdirectory of main job output. >
Re: When speculative execution is true, there is a data loss issue with multpleoutputs
Dne 21.11.2012 16:07, AnilKumar B napsal(a): Thanks Radim. Yes, as you said we are not writing into sub-directory of main job. I will try by making them as sub-directories of output dir. But one question, when I turn of speculative execution then it is working fine with same multiple output directory structure. May I know, how exactly it working in this case? When we change the speculative execution flag, why exactly there is a difference in output data? because if you are not using multipleoutput then you are not writing to real file, but to file with name generated from its task attempt in tmp subdirectory. They do not overwrite each other. In HDFS you can have only one writer per file.
Re: When speculative execution is true, there is a data loss issue with multpleoutputs
this is another problem with fileoutputformat committer, its related to your. https://issues.apache.org/jira/browse/MAPREDUCE-3772 it works like this: if multipleoutput is relative to job output, then there is a workaround to make it work with commiter and outputs from multiple tasks do not clash with each other, problem mentioned in ticket cheats that relative vs absolute output path detection and all output is lost on task commit. But if output is absolute path, then its written directly to output file which fails because writers from multiple attempts crash together.
io.file.buffer.size
Guys, I've read that increasing above (default 4kb) number to, say 128kb, might speed things up. My input is 40mln serialised records coming from RDMS and I noticed that with increased IO my job actually runs a tiny bit slower. Is that possible? p.s. got two questions: 1. During Sqoop import I see that two additional files are generated in the HDFS folder, namely .../_log/history/...conf.xml .../_log/history/...sqoop_generated_class.jar Is there a way to redirect these files to a different directory? I cannot find an answer. 2. I run multiple reducers and each generate each own output. If I was to merge all the output, will running either of the below commands be recommended? hadoop dfs -getmerge or hadoop dfs -cat output/* > output_All hadoop dfs -get output_All Thanks, AK NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
guessing number of reducers.
By default the number of reducers is set to 1.. Is there a good way to guess optimal number of reducers Or let's say i have tbs worth of data... mappers are of order 5000 or so... But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring... Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice.. What's the best way to solve this.. How to guess optimal number of reducers.. Thanks
RE: guessing number of reducers.
Jamal, This is what I am using... After you start your job, visit jobtracker's WebUI :50030 And look for Cluster summary. Reduce Task Capacity shall hint you what optimally set your number to. I could be wrong but it works for me. :) Cluster Summary (Heap Size is *** MB/966.69 MB) Running Map Tasks Running Reduce Tasks Total Submissions Nodes Occupied Map Slots Occupied Reduce Slots Reserved Map Slots Reserved Reduce Slots Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted Nodes Excluded Nodes Rgds, AK47 From: jamal sasha [mailto:jamalsha...@gmail.com] Sent: Wednesday, November 21, 2012 11:39 AM To: user@hadoop.apache.org Subject: guessing number of reducers. By default the number of reducers is set to 1.. Is there a good way to guess optimal number of reducers Or let's say i have tbs worth of data... mappers are of order 5000 or so... But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring... Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice.. What's the best way to solve this.. How to guess optimal number of reducers.. Thanks NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel
Re: guessing number of reducers.
Hi Sasha In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers. If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results. In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: jamal sasha Date: Wed, 21 Nov 2012 11:38:38 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: guessing number of reducers. By default the number of reducers is set to 1.. Is there a good way to guess optimal number of reducers Or let's say i have tbs worth of data... mappers are of order 5000 or so... But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring... Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice.. What's the best way to solve this.. How to guess optimal number of reducers.. Thanks
RE: guessing number of reducers.
Bejoy, I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested: Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity: 1 Reducer - 22mins 4 Reducers - 11.5mins 8 Reducers - 5mins 10 Reducers - 7mins 12 Reducers - 6:5mins 16 Reducers - 5.5mins 8 Reducers have won the race. But Reducers at the max capacity was very clos. :) AK47 From: Bejoy KS [mailto:bejoy.had...@gmail.com] Sent: Wednesday, November 21, 2012 11:51 AM To: user@hadoop.apache.org Subject: Re: guessing number of reducers. Hi Sasha In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers. If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results. In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster. Regards Bejoy KS Sent from handheld, please excuse typos. From: jamal sasha Date: Wed, 21 Nov 2012 11:38:38 -0500 To: user@hadoop.apache.org ReplyTo: user@hadoop.apache.org Subject: guessing number of reducers. By default the number of reducers is set to 1.. Is there a good way to guess optimal number of reducers Or let's say i have tbs worth of data... mappers are of order 5000 or so... But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring... Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice.. What's the best way to solve this.. How to guess optimal number of reducers.. Thanks NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel
Re: guessing number of reducers.
Hi, How to set no of reducers in job conf dynamically? For example some days i am getting 500GB of data on heavy traffic and some days 100GB only. Thanks in advance! Cheers! Manoj. On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy wrote: > Bejoy, > > > > I’ve read somethere about keeping number of mapred.reduce.tasks below the > reduce task capcity. Here is what I just tested: > > > > Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity: > > > > 1 Reducer – 22mins > > 4 Reducers – 11.5mins > > 8 Reducers – 5mins > > 10 Reducers – 7mins > > 12 Reducers – 6:5mins > > 16 Reducers – 5.5mins > > > > 8 Reducers have won the race. But Reducers at the max capacity was very > clos. J > > > > AK47 > > > > > > *From:* Bejoy KS [mailto:bejoy.had...@gmail.com] > *Sent:* Wednesday, November 21, 2012 11:51 AM > *To:* user@hadoop.apache.org > *Subject:* Re: guessing number of reducers. > > > > Hi Sasha > > In general the number of reduce tasks is chosen mainly based on the data > volume to reduce phase. In tools like hive and pig by default for every 1GB > of map output there will be a reducer. So if you have 100 gigs of map > output then 100 reducers. > If your tasks are more CPU intensive then you need lesser volume of data > per reducer for better performance results. > > In general it is better to have the number of reduce tasks slightly less > than the number of available reduce slots in the cluster. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > -- > > *From: *jamal sasha > > *Date: *Wed, 21 Nov 2012 11:38:38 -0500 > > *To: *user@hadoop.apache.org > > *ReplyTo: *user@hadoop.apache.org > > *Subject: *guessing number of reducers. > > > > By default the number of reducers is set to 1.. > Is there a good way to guess optimal number of reducers > Or let's say i have tbs worth of data... mappers are of order 5000 or so... > But ultimately i am calculating , let's say, some average of whole data... > say average transaction occurring... > Now the output will be just one line in one "part"... rest of them will be > empty.So i am guessing i need loads of reducers but then most of them will > be empty but at the same time one reducer won't suffice.. > What's the best way to solve this.. > How to guess optimal number of reducers.. > Thanks > NOTICE: This e-mail message and any attachments are confidential, subject > to copyright and may be privileged. Any unauthorized use, copying or > disclosure is prohibited. If you are not the intended recipient, please > delete and contact the sender immediately. Please consider the environment > before printing this e-mail. AVIS : le présent courriel et toute pièce > jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur > et peuvent être couverts par le secret professionnel. Toute utilisation, > copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le > destinataire prévu de ce courriel, supprimez-le et contactez immédiatement > l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent > courriel >
Get the name of node where mapper is running
Hallo guys, how can i find out on which node a mapper is running ? Thx Eduard
Re: guessing number of reducers.
Hello Jamal, I use a different approach based on the no of cores. If you have, say a 4 cores machine then you can have (0.75*no cores)no. of MR slots. For example, if you have 4 physical cores OR 8 virtual cores then you can have 0.75*8=6 MR slots. You can then set 3M+3R or 4M+2R and so on as per your requirement. Regards, Mohammad Tariq On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy wrote: > Bejoy, > > > > I’ve read somethere about keeping number of mapred.reduce.tasks below the > reduce task capcity. Here is what I just tested: > > > > Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity: > > > > 1 Reducer – 22mins > > 4 Reducers – 11.5mins > > 8 Reducers – 5mins > > 10 Reducers – 7mins > > 12 Reducers – 6:5mins > > 16 Reducers – 5.5mins > > > > 8 Reducers have won the race. But Reducers at the max capacity was very > clos. J > > > > AK47 > > > > > > *From:* Bejoy KS [mailto:bejoy.had...@gmail.com] > *Sent:* Wednesday, November 21, 2012 11:51 AM > *To:* user@hadoop.apache.org > *Subject:* Re: guessing number of reducers. > > > > Hi Sasha > > In general the number of reduce tasks is chosen mainly based on the data > volume to reduce phase. In tools like hive and pig by default for every 1GB > of map output there will be a reducer. So if you have 100 gigs of map > output then 100 reducers. > If your tasks are more CPU intensive then you need lesser volume of data > per reducer for better performance results. > > In general it is better to have the number of reduce tasks slightly less > than the number of available reduce slots in the cluster. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > -- > > *From: *jamal sasha > > *Date: *Wed, 21 Nov 2012 11:38:38 -0500 > > *To: *user@hadoop.apache.org > > *ReplyTo: *user@hadoop.apache.org > > *Subject: *guessing number of reducers. > > > > By default the number of reducers is set to 1.. > Is there a good way to guess optimal number of reducers > Or let's say i have tbs worth of data... mappers are of order 5000 or so... > But ultimately i am calculating , let's say, some average of whole data... > say average transaction occurring... > Now the output will be just one line in one "part"... rest of them will be > empty.So i am guessing i need loads of reducers but then most of them will > be empty but at the same time one reducer won't suffice.. > What's the best way to solve this.. > How to guess optimal number of reducers.. > Thanks > NOTICE: This e-mail message and any attachments are confidential, subject > to copyright and may be privileged. Any unauthorized use, copying or > disclosure is prohibited. If you are not the intended recipient, please > delete and contact the sender immediately. Please consider the environment > before printing this e-mail. AVIS : le présent courriel et toute pièce > jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur > et peuvent être couverts par le secret professionnel. Toute utilisation, > copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le > destinataire prévu de ce courriel, supprimez-le et contactez immédiatement > l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent > courriel >
Re: Get the name of node where mapper is running
Hello, the JobTracker has a built-in Web UI (http://hostname_of_jobtracker:50030/) where you can get details for all completed and running jobs. For the map phase, it will tell you on which physical hosts the tasks were executed. Kai Am 21.11.2012 um 19:04 schrieb Eduard Skaley : > Hallo guys, > > how can i find out on which node a mapper is running ? > > Thx > Eduard > -- Kai Voigt k...@123.org
Re: Facebook corona compatibility
Hi Amit, There is a mention here to Start in the hadoop-20 parent path : https://github.com/facebook/hadoop-20/wiki/Corona-Single-Node-Setup Regards, Rob On Mon, Nov 12, 2012 at 8:01 AM, Amit Sela wrote: > Hi everyone, > > Anyone knows if the new corona tools (Facebook just released as open > source) are compatible with hadoop 1.0.x ? or just 0.20.x ? > > Thanks. >
Re: guessing number of reducers.
Hi Andy It is usually so because if you have more reduce tasks than the reduce slots in your cluster then a few of the reduce tasks will be in queue waiting for its turn. So it is better to keep the num of reduce tasks slightly less than the reduce task capacity so that all reduce tasks run at once in parallel. But in some cases each reducer can process only certain volume of data due to some constraints, like data beyond a certain limit may lead to OOMs. In such cases you may need to configure the number of reducers totally based on your data and not based on slots. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: "Kartashov, Andy" Date: Wed, 21 Nov 2012 17:49:50 To: user@hadoop.apache.org; bejoy.had...@gmail.com Subject: RE: guessing number of reducers. Bejoy, I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested: Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity: 1 Reducer - 22mins 4 Reducers - 11.5mins 8 Reducers - 5mins 10 Reducers - 7mins 12 Reducers - 6:5mins 16 Reducers - 5.5mins 8 Reducers have won the race. But Reducers at the max capacity was very clos. :) AK47 From: Bejoy KS [mailto:bejoy.had...@gmail.com] Sent: Wednesday, November 21, 2012 11:51 AM To: user@hadoop.apache.org Subject: Re: guessing number of reducers. Hi Sasha In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers. If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results. In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster. Regards Bejoy KS Sent from handheld, please excuse typos. From: jamal sasha Date: Wed, 21 Nov 2012 11:38:38 -0500 To: user@hadoop.apache.org ReplyTo: user@hadoop.apache.org Subject: guessing number of reducers. By default the number of reducers is set to 1.. Is there a good way to guess optimal number of reducers Or let's say i have tbs worth of data... mappers are of order 5000 or so... But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring... Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice.. What's the best way to solve this.. How to guess optimal number of reducers.. Thanks NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel
Re: guessing number of reducers.
Thanks for the input guys. This helps alot :) On Wednesday, November 21, 2012, Bejoy KS wrote: > Hi Andy > > It is usually so because if you have more reduce tasks than the reduce slots in your cluster then a few of the reduce tasks will be in queue waiting for its turn. So it is better to keep the num of reduce tasks slightly less than the reduce task capacity so that all reduce tasks run at once in parallel. > > But in some cases each reducer can process only certain volume of data due to some constraints, like data beyond a certain limit may lead to OOMs. In such cases you may need to configure the number of reducers totally based on your data and not based on slots. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > > From: "Kartashov, Andy" > Date: Wed, 21 Nov 2012 17:49:50 + > To: user@hadoop.apache.org; bejoy.had...@gmail.com > Subject: RE: guessing number of reducers. > > Bejoy, > > > > I’ve read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested: > > > > Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity: > > > > 1 Reducer – 22mins > > 4 Reducers – 11.5mins > > 8 Reducers – 5mins > > 10 Reducers – 7mins > > 12 Reducers – 6:5mins > > 16 Reducers – 5.5mins > > > > 8 Reducers have won the race. But Reducers at the max capacity was very clos. J > > > > AK47 > > > > > > From: Bejoy KS [mailto:bejoy.had...@gmail.com] > Sent: Wednesday, November 21, 2012 11:51 AM > To: user@hadoop.apache.org > Subject: Re: guessing number of reducers. > > > > Hi Sasha > > In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers. > If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results. > > In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > > > > From: jamal sasha > > Date: Wed, 21 Nov 2012 11:38:38 -0500 > > To: user@hadoop.apache.org
Re: guessing number of reducers.
Hi Manoj If you intend to calculate the number of reducers based on the input size, then in your driver class you should get the size of the input dir in hdfs and say you intended to give n bytes to a reducer then the number of reducers can be computed as Total input size/ bytes per reducer. You can round this value and use it to set the number of reducers in conf programatically. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Manoj Babu Date: Wed, 21 Nov 2012 23:28:00 To: Cc: bejoy.had...@gmail.com Subject: Re: guessing number of reducers. Hi, How to set no of reducers in job conf dynamically? For example some days i am getting 500GB of data on heavy traffic and some days 100GB only. Thanks in advance! Cheers! Manoj. On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy wrote: > Bejoy, > > > > I’ve read somethere about keeping number of mapred.reduce.tasks below the > reduce task capcity. Here is what I just tested: > > > > Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity: > > > > 1 Reducer – 22mins > > 4 Reducers – 11.5mins > > 8 Reducers – 5mins > > 10 Reducers – 7mins > > 12 Reducers – 6:5mins > > 16 Reducers – 5.5mins > > > > 8 Reducers have won the race. But Reducers at the max capacity was very > clos. J > > > > AK47 > > > > > > *From:* Bejoy KS [mailto:bejoy.had...@gmail.com] > *Sent:* Wednesday, November 21, 2012 11:51 AM > *To:* user@hadoop.apache.org > *Subject:* Re: guessing number of reducers. > > > > Hi Sasha > > In general the number of reduce tasks is chosen mainly based on the data > volume to reduce phase. In tools like hive and pig by default for every 1GB > of map output there will be a reducer. So if you have 100 gigs of map > output then 100 reducers. > If your tasks are more CPU intensive then you need lesser volume of data > per reducer for better performance results. > > In general it is better to have the number of reduce tasks slightly less > than the number of available reduce slots in the cluster. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > -- > > *From: *jamal sasha > > *Date: *Wed, 21 Nov 2012 11:38:38 -0500 > > *To: *user@hadoop.apache.org > > *ReplyTo: *user@hadoop.apache.org > > *Subject: *guessing number of reducers. > > > > By default the number of reducers is set to 1.. > Is there a good way to guess optimal number of reducers > Or let's say i have tbs worth of data... mappers are of order 5000 or so... > But ultimately i am calculating , let's say, some average of whole data... > say average transaction occurring... > Now the output will be just one line in one "part"... rest of them will be > empty.So i am guessing i need loads of reducers but then most of them will > be empty but at the same time one reducer won't suffice.. > What's the best way to solve this.. > How to guess optimal number of reducers.. > Thanks > NOTICE: This e-mail message and any attachments are confidential, subject > to copyright and may be privileged. Any unauthorized use, copying or > disclosure is prohibited. If you are not the intended recipient, please > delete and contact the sender immediately. Please consider the environment > before printing this e-mail. AVIS : le présent courriel et toute pièce > jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur > et peuvent être couverts par le secret professionnel. Toute utilisation, > copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le > destinataire prévu de ce courriel, supprimez-le et contactez immédiatement > l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent > courriel >
MapReduce logs
Hi, When we run a MapReduce job, the logs are stored on all the tasktracker nodes. Is there an easy way to agregate all those logs together and see them in a single place instead of going to the tasks one by one and open the file? Thanks, JM
Re: MapReduce logs
Hi, We had similar requirement and we built small Java application which gets information about task nodes from Job Tracker and download logs into one file using URLs of each task tracker. For huge logs this becomes slow and time consuming. Hope this helps. Regards, Dino Kečo msn: xdi...@hotmail.com mail: dino.k...@gmail.com skype: dino.keco phone: +387 61 507 851 On Wed, Nov 21, 2012 at 7:55 PM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > Hi, > > When we run a MapReduce job, the logs are stored on all the tasktracker > nodes. > > Is there an easy way to agregate all those logs together and see them > in a single place instead of going to the tasks one by one and open > the file? > > Thanks, > > JM >
fundamental doubt
Hi.. I guess i am asking alot of fundamental questions but i thank you guys for taking out time to explain my doubts. So i am able to write map reduce jobs but here is my mydoubt As of now i am writing mappers which emit key and a value This key value is then captured at reducer end and then i process the key and value there. Let's say i want to calculate the average... Key1 value1 Key2 value 2 Key 1 value 3 So the output is something like Key1 average of value 1 and value 3 Key2 average 2 = value 2 Right now in reducer i have to create a dictionary with key as original keys and value is a list. Data = defaultdict(list) == // python usrr But i thought that Mapper takes in the key value pairs and outputs key: ( v1,v2)and Reducer takes in this key and list of values and returns Key , new value.. So why is the input of reducer the simple output of mapper and not the list of all the values to a particular key or did i understood something. Am i making any sense ??
Re: MapReduce logs
Thanks for the info. I have quickly draft this bash script in case it can help someone... You just neeed to make sure the IP inside is replaced. To call it, you need to give the job task page. ./showLogs.sh "http://192.168.23.7:50030/jobtasks.jsp?jobid=job_201211211408_0001&type=map&pagenum=1"; Then you can redirect the output, or do what ever you want. I was wondering if there was a "nicer" solution... :~/test$ cat showLogs.sh #!/bin/bash rm -f tasks.html wget --quiet --output-document tasks.html $1 for i in `cat tasks.html | grep taskdetails | cut -d"\"" -f2 | grep taskdetails`; do rm -f tasksdetails.html wget --quiet --output-document tasksdetails.html http://192.168.23.7:50030/$i for j in `cat tasksdetails.html | grep "all=true" | cut -d"\"" -f6`; do printf "*"%.0s {1..80} echo echo $j printf "*"%.0s {1..80} echo rm -f logs.txt wget --quiet --output-document logs.txt $j tail -n +31 logs.txt | head -n -2 done done rm -f tasks.html rm -f tasksdetails.html rm -f logs.txt 2012/11/21, Dino Kečo : > Hi, > > We had similar requirement and we built small Java application which gets > information about task nodes from Job Tracker and download logs into one > file using URLs of each task tracker. > > For huge logs this becomes slow and time consuming. > > Hope this helps. > > Regards, > Dino Kečo > msn: xdi...@hotmail.com > mail: dino.k...@gmail.com > skype: dino.keco > phone: +387 61 507 851 > > > On Wed, Nov 21, 2012 at 7:55 PM, Jean-Marc Spaggiari < > jean-m...@spaggiari.org> wrote: > >> Hi, >> >> When we run a MapReduce job, the logs are stored on all the tasktracker >> nodes. >> >> Is there an easy way to agregate all those logs together and see them >> in a single place instead of going to the tasks one by one and open >> the file? >> >> Thanks, >> >> JM >> >
Re: fundamental doubt
Hello Jamal, For efficient processing all the values associated with the same key get sorted and go to same reducer. As a result the reducer gets a key and a list of values as its input. To me your assumption seems correct. Regards, Mohammad Tariq On Thu, Nov 22, 2012 at 1:20 AM, jamal sasha wrote: > Hi.. > I guess i am asking alot of fundamental questions but i thank you guys for > taking out time to explain my doubts. > So i am able to write map reduce jobs but here is my mydoubt > As of now i am writing mappers which emit key and a value > This key value is then captured at reducer end and then i process the key > and value there. > Let's say i want to calculate the average... > Key1 value1 > Key2 value 2 > Key 1 value 3 > > So the output is something like > Key1 average of value 1 and value 3 > Key2 average 2 = value 2 > > Right now in reducer i have to create a dictionary with key as original > keys and value is a list. > Data = defaultdict(list) == // python usrr > But i thought that > Mapper takes in the key value pairs and outputs key: ( v1,v2)and > Reducer takes in this key and list of values and returns > Key , new value.. > > So why is the input of reducer the simple output of mapper and not the > list of all the values to a particular key or did i understood something. > Am i making any sense ??
Re: fundamental doubt
Hi Jamal It is performed at a frame work level map emits key value pairs and the framework collects and groups all the values corresponding to a key from all the map tasks. Now the reducer takes the input as a key and a collection of values only. The reduce method signature defines it. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: jamal sasha Date: Wed, 21 Nov 2012 14:50:51 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: fundamental doubt Hi.. I guess i am asking alot of fundamental questions but i thank you guys for taking out time to explain my doubts. So i am able to write map reduce jobs but here is my mydoubt As of now i am writing mappers which emit key and a value This key value is then captured at reducer end and then i process the key and value there. Let's say i want to calculate the average... Key1 value1 Key2 value 2 Key 1 value 3 So the output is something like Key1 average of value 1 and value 3 Key2 average 2 = value 2 Right now in reducer i have to create a dictionary with key as original keys and value is a list. Data = defaultdict(list) == // python usrr But i thought that Mapper takes in the key value pairs and outputs key: ( v1,v2)and Reducer takes in this key and list of values and returns Key , new value.. So why is the input of reducer the simple output of mapper and not the list of all the values to a particular key or did i understood something. Am i making any sense ??
Re: fundamental doubt
got it. thanks for clarification On Wed, Nov 21, 2012 at 3:03 PM, Bejoy KS wrote: > ** > Hi Jamal > > It is performed at a frame work level map emits key value pairs and the > framework collects and groups all the values corresponding to a key from > all the map tasks. Now the reducer takes the input as a key and a > collection of values only. The reduce method signature defines it. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > -- > *From: * jamal sasha > *Date: *Wed, 21 Nov 2012 14:50:51 -0500 > *To: *user@hadoop.apache.org > *ReplyTo: * user@hadoop.apache.org > *Subject: *fundamental doubt > > Hi.. > I guess i am asking alot of fundamental questions but i thank you guys for > taking out time to explain my doubts. > So i am able to write map reduce jobs but here is my mydoubt > As of now i am writing mappers which emit key and a value > This key value is then captured at reducer end and then i process the key > and value there. > Let's say i want to calculate the average... > Key1 value1 > Key2 value 2 > Key 1 value 3 > > So the output is something like > Key1 average of value 1 and value 3 > Key2 average 2 = value 2 > > Right now in reducer i have to create a dictionary with key as original > keys and value is a list. > Data = defaultdict(list) == // python usrr > But i thought that > Mapper takes in the key value pairs and outputs key: ( v1,v2)and > Reducer takes in this key and list of values and returns > Key , new value.. > > So why is the input of reducer the simple output of mapper and not the > list of all the values to a particular key or did i understood something. > Am i making any sense ?? >
Re: Pentaho
A better place to ask this at, is at the Pentaho's own community http://wiki.pentaho.com/display/BAD/Pentaho+Big+Data+Community+Home. At a glance, they have forums and IRC you could use to ask your questions about their product. On Wed, Nov 21, 2012 at 11:40 PM, suneel hadoop wrote: > > Hi all, > Any material available on pentaho kettle > Thanks, > Suneel... -- Harsh J
Re: Facebook corona compatibility
IIRC, Facebook's own hadoop branch (Github: facebook/hadoop I guess), does not support or carry any security features, which Apache Hadoop 0.20.203 -> 1.1.x now carries. So out of the box, I expect it to be incompatible with any of the recent Apache releases. On Mon, Nov 12, 2012 at 9:31 PM, Amit Sela wrote: > Hi everyone, > > Anyone knows if the new corona tools (Facebook just released as open source) > are compatible with hadoop 1.0.x ? or just 0.20.x ? > > Thanks. -- Harsh J
Re: guessing number of reducers.
Thank you for the info Bejoy. Cheers! Manoj. On Thu, Nov 22, 2012 at 12:04 AM, Bejoy KS wrote: > ** > Hi Manoj > > If you intend to calculate the number of reducers based on the input size, > then in your driver class you should get the size of the input dir in hdfs > and say you intended to give n bytes to a reducer then the number of > reducers can be computed as > Total input size/ bytes per reducer. > > You can round this value and use it to set the number of reducers in conf > programatically. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > -- > *From: * Manoj Babu > *Date: *Wed, 21 Nov 2012 23:28:00 +0530 > *To: * > *Cc: *bejoy.had...@gmail.com > *Subject: *Re: guessing number of reducers. > > Hi, > > How to set no of reducers in job conf dynamically? > For example some days i am getting 500GB of data on heavy traffic and some > days 100GB only. > > Thanks in advance! > > Cheers! > Manoj. > > > > On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy > wrote: > >> Bejoy, >> >> >> >> I’ve read somethere about keeping number of mapred.reduce.tasks below the >> reduce task capcity. Here is what I just tested: >> >> >> >> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity: >> >> >> >> 1 Reducer – 22mins >> >> 4 Reducers – 11.5mins >> >> 8 Reducers – 5mins >> >> 10 Reducers – 7mins >> >> 12 Reducers – 6:5mins >> >> 16 Reducers – 5.5mins >> >> >> >> 8 Reducers have won the race. But Reducers at the max capacity was very >> clos. J >> >> >> >> AK47 >> >> >> >> >> >> *From:* Bejoy KS [mailto:bejoy.had...@gmail.com] >> *Sent:* Wednesday, November 21, 2012 11:51 AM >> *To:* user@hadoop.apache.org >> *Subject:* Re: guessing number of reducers. >> >> >> >> Hi Sasha >> >> In general the number of reduce tasks is chosen mainly based on the data >> volume to reduce phase. In tools like hive and pig by default for every 1GB >> of map output there will be a reducer. So if you have 100 gigs of map >> output then 100 reducers. >> If your tasks are more CPU intensive then you need lesser volume of data >> per reducer for better performance results. >> >> In general it is better to have the number of reduce tasks slightly less >> than the number of available reduce slots in the cluster. >> >> Regards >> Bejoy KS >> >> Sent from handheld, please excuse typos. >> -- >> >> *From: *jamal sasha >> >> *Date: *Wed, 21 Nov 2012 11:38:38 -0500 >> >> *To: *user@hadoop.apache.org >> >> *ReplyTo: *user@hadoop.apache.org >> >> *Subject: *guessing number of reducers. >> >> >> >> By default the number of reducers is set to 1.. >> Is there a good way to guess optimal number of reducers >> Or let's say i have tbs worth of data... mappers are of order 5000 or >> so... >> But ultimately i am calculating , let's say, some average of whole >> data... say average transaction occurring... >> Now the output will be just one line in one "part"... rest of them will >> be empty.So i am guessing i need loads of reducers but then most of them >> will be empty but at the same time one reducer won't suffice.. >> What's the best way to solve this.. >> How to guess optimal number of reducers.. >> Thanks >> NOTICE: This e-mail message and any attachments are confidential, >> subject to copyright and may be privileged. Any unauthorized use, copying >> or disclosure is prohibited. If you are not the intended recipient, please >> delete and contact the sender immediately. Please consider the environment >> before printing this e-mail. AVIS : le présent courriel et toute pièce >> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur >> et peuvent être couverts par le secret professionnel. Toute utilisation, >> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le >> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement >> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent >> courriel >> > >
Re: Hadoop Web Interface Security
thanks harsh any hints on how to give user.name in configuration files for simple authentication,is that given as a property On Wed, Nov 21, 2012 at 5:52 PM, Harsh J wrote: > Yes, see > http://hadoop.apache.org/docs/current/hadoop-auth/Configuration.html > and also see http://hadoop.apache.org/docs/stable/HttpAuthentication.html > > On Wed, Nov 21, 2012 at 3:34 PM, Visioner Sadak > wrote: > > Hi as we knw that by using hadoop's web UI at > http://namenode-ip/50070 > > anyone can access the hdfs details can we secure it only to certain > > authorized users and not publicly to all.. in production > > > > -- > Harsh J >
RE: HADOOP UPGRADE ISSUE
start-all.sh will not carry any arguments to pass to nodes. Start with start-dfs.sh or start directly namenode with upgrade option. ./hadoop namenode -upgrade Regards, Uma From: yogesh dhari [yogeshdh...@live.com] Sent: Thursday, November 22, 2012 12:23 PM To: hadoop helpforoum Subject: HADOOP UPGRADE ISSUE Hi All, I am trying upgrading apache hadoop-0.20.2 to hadoop-1.0.4. I have give same dfs.name.dir, etc as same in hadoop-1.0.4' conf files as were in hadoop-0.20.2. Now I am starting dfs n mapred using start-all.sh -upgrade but namenode and datanode fail to run. 1) Namenode's logs shows:: ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: File system image contains an old layout version -18. An upgrade to version -32 is required. Please restart NameNode with -upgrade option. . . ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: File system image contains an old layout version -18. An upgrade to version -32 is required. Please restart NameNode with -upgrade option. 2) Datanode's logs shows:: WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Invalid directory in dfs.data.dir: Incorrect permission for /opt/hadoop_newdata_dirr, expected: rwxr-xr-x, while actual: rwxrwxrwx ( how these file permission showing warnings)* 2012-11-22 12:05:21,157 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: All directories in dfs.data.dir are invalid. Please suggest Thanks & Regards Yogesh Kumar