Re: Running terasort with 1 map task
http://wiki.apache.org/hadoop/HowManyMapsAndReduces It is possible to have a single mapper if the input is not splittable BUT it is rarely seen as a feature. One could ask why you want to use a platform for distributed computing for a job that shouldn't be distributed. Regards Bertrand On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: Hi all, I am trying to run terasort using one map and one reduce. so, I generated the input data using: hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map Then I launched the hadoop terasort job using: hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1 I thought it will run the job using 1 map and 1 reduce, but when inspect the job statistics I found: hadoop job -history /user/hadoop/output1 Task Summary KindTotalSuccessfulFailedKilledStartTimeFinishTime Setup110026-Feb-2013 10:57:4726-Feb-2013 10:57:55 (8sec) Map24240026-Feb-2013 10:57:5726-Feb-2013 11:05:37 (7mins, 40sec) Reduce110026-Feb-2013 10:58:2126-Feb-2013 11:08:31 (10mins, 10sec) Cleanup110026-Feb-2013 11:08:3226-Feb-2013 11:08:36 (4sec) so, though I mentioned to launch one map tasks, there are 24 of them. How to solve this problem. How to tell hadoop to launch only one map. Thanks,
Re: Running terasort with 1 map task
Maybe your goal is to have a baseline for performance measurement? In that case, you might want to consider running only one taskTracker? You would have multiple tasks but running on only 1 machine. Also, you could make mappers run serially, by configuring only one map slot on your 1 node cluster. Nevertheless I agree with Bertrand, this is not really a realistic use case (or maybe you can give us more clues). Julien 2013/2/26 Bertrand Dechoux decho...@gmail.com http://wiki.apache.org/hadoop/HowManyMapsAndReduces It is possible to have a single mapper if the input is not splittable BUT it is rarely seen as a feature. One could ask why you want to use a platform for distributed computing for a job that shouldn't be distributed. Regards Bertrand On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: Hi all, I am trying to run terasort using one map and one reduce. so, I generated the input data using: hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map Then I launched the hadoop terasort job using: hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1 I thought it will run the job using 1 map and 1 reduce, but when inspect the job statistics I found: hadoop job -history /user/hadoop/output1 Task Summary KindTotalSuccessfulFailedKilledStartTimeFinishTime Setup110026-Feb-2013 10:57:4726-Feb-2013 10:57:55 (8sec) Map24240026-Feb-2013 10:57:5726-Feb-2013 11:05:37 (7mins, 40sec) Reduce110026-Feb-2013 10:58:2126-Feb-2013 11:08:31 (10mins, 10sec) Cleanup110026-Feb-2013 11:08:3226-Feb-2013 11:08:36 (4sec) so, though I mentioned to launch one map tasks, there are 24 of them. How to solve this problem. How to tell hadoop to launch only one map. Thanks,
Re: Running terasort with 1 map task
Thanks . As Julien said I want to do a performance measurement. Actually, hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map has generated: Total size:3200029737 B Total dirs:3 Total files:5 Total blocks (validated):27 (avg. block size 118519619 B) Thats why so many maps. On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller julien.mul...@ezako.comwrote: Maybe your goal is to have a baseline for performance measurement? In that case, you might want to consider running only one taskTracker? You would have multiple tasks but running on only 1 machine. Also, you could make mappers run serially, by configuring only one map slot on your 1 node cluster. Nevertheless I agree with Bertrand, this is not really a realistic use case (or maybe you can give us more clues). Julien 2013/2/26 Bertrand Dechoux decho...@gmail.com http://wiki.apache.org/hadoop/HowManyMapsAndReduces It is possible to have a single mapper if the input is not splittable BUT it is rarely seen as a feature. One could ask why you want to use a platform for distributed computing for a job that shouldn't be distributed. Regards Bertrand On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: Hi all, I am trying to run terasort using one map and one reduce. so, I generated the input data using: hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map Then I launched the hadoop terasort job using: hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1 I thought it will run the job using 1 map and 1 reduce, but when inspect the job statistics I found: hadoop job -history /user/hadoop/output1 Task Summary KindTotalSuccessfulFailedKilledStartTime FinishTime Setup110026-Feb-2013 10:57:4726-Feb-2013 10:57:55 (8sec) Map24240026-Feb-2013 10:57:5726-Feb-2013 11:05:37 (7mins, 40sec) Reduce110026-Feb-2013 10:58:2126-Feb-2013 11:08:31 (10mins, 10sec) Cleanup110026-Feb-2013 11:08:3226-Feb-2013 11:08:36 (4sec) so, though I mentioned to launch one map tasks, there are 24 of them. How to solve this problem. How to tell hadoop to launch only one map. Thanks,
Re: Running terasort with 1 map task
In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block size property namedfs.block.size/name value134217728/value finaltrue/final /property While running the teragen I am again specifying it to be sure: hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728 32 /user/hadoop/input but it generates 3 blocks: hadoop fsck -blocks -files -locations /user/hadoop/input Status: HEALTHY Total size:32029543 B Total dirs:3 Total files:4 Total blocks (validated):3 (avg. block size 10676514 B) Minimally replicated blocks:3 (100.0 %) What I am doing wrong? How can I generate only one block? On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: Thanks . As Julien said I want to do a performance measurement. Actually, hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map has generated: Total size:3200029737 B Total dirs:3 Total files:5 Total blocks (validated):27 (avg. block size 118519619 B) Thats why so many maps. On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller julien.mul...@ezako.comwrote: Maybe your goal is to have a baseline for performance measurement? In that case, you might want to consider running only one taskTracker? You would have multiple tasks but running on only 1 machine. Also, you could make mappers run serially, by configuring only one map slot on your 1 node cluster. Nevertheless I agree with Bertrand, this is not really a realistic use case (or maybe you can give us more clues). Julien 2013/2/26 Bertrand Dechoux decho...@gmail.com http://wiki.apache.org/hadoop/HowManyMapsAndReduces It is possible to have a single mapper if the input is not splittable BUT it is rarely seen as a feature. One could ask why you want to use a platform for distributed computing for a job that shouldn't be distributed. Regards Bertrand On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: Hi all, I am trying to run terasort using one map and one reduce. so, I generated the input data using: hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map Then I launched the hadoop terasort job using: hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1 I thought it will run the job using 1 map and 1 reduce, but when inspect the job statistics I found: hadoop job -history /user/hadoop/output1 Task Summary KindTotalSuccessfulFailedKilledStartTime FinishTime Setup110026-Feb-2013 10:57:4726-Feb-2013 10:57:55 (8sec) Map24240026-Feb-2013 10:57:5726-Feb-2013 11:05:37 (7mins, 40sec) Reduce110026-Feb-2013 10:58:2126-Feb-2013 11:08:31 (10mins, 10sec) Cleanup110026-Feb-2013 11:08:3226-Feb-2013 11:08:36 (4sec) so, though I mentioned to launch one map tasks, there are 24 of them. How to solve this problem. How to tell hadoop to launch only one map. Thanks,
Re: Running terasort with 1 map task
sorry my bad, it solved On Tue, Feb 26, 2013 at 1:22 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block size property namedfs.block.size/name value134217728/value finaltrue/final /property While running the teragen I am again specifying it to be sure: hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728 32 /user/hadoop/input but it generates 3 blocks: hadoop fsck -blocks -files -locations /user/hadoop/input Status: HEALTHY Total size:32029543 B Total dirs:3 Total files:4 Total blocks (validated):3 (avg. block size 10676514 B) Minimally replicated blocks:3 (100.0 %) What I am doing wrong? How can I generate only one block? On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: Thanks . As Julien said I want to do a performance measurement. Actually, hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map has generated: Total size:3200029737 B Total dirs:3 Total files:5 Total blocks (validated):27 (avg. block size 118519619 B) Thats why so many maps. On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller julien.mul...@ezako.comwrote: Maybe your goal is to have a baseline for performance measurement? In that case, you might want to consider running only one taskTracker? You would have multiple tasks but running on only 1 machine. Also, you could make mappers run serially, by configuring only one map slot on your 1 node cluster. Nevertheless I agree with Bertrand, this is not really a realistic use case (or maybe you can give us more clues). Julien 2013/2/26 Bertrand Dechoux decho...@gmail.com http://wiki.apache.org/hadoop/HowManyMapsAndReduces It is possible to have a single mapper if the input is not splittable BUT it is rarely seen as a feature. One could ask why you want to use a platform for distributed computing for a job that shouldn't be distributed. Regards Bertrand On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: Hi all, I am trying to run terasort using one map and one reduce. so, I generated the input data using: hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map Then I launched the hadoop terasort job using: hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1 I thought it will run the job using 1 map and 1 reduce, but when inspect the job statistics I found: hadoop job -history /user/hadoop/output1 Task Summary KindTotalSuccessfulFailedKilledStartTime FinishTime Setup110026-Feb-2013 10:57:4726-Feb-2013 10:57:55 (8sec) Map24240026-Feb-2013 10:57:5726-Feb-2013 11:05:37 (7mins, 40sec) Reduce110026-Feb-2013 10:58:2126-Feb-2013 11:08:31 (10mins, 10sec) Cleanup110026-Feb-2013 11:08:3226-Feb-2013 11:08:36 (4sec) so, though I mentioned to launch one map tasks, there are 24 of them. How to solve this problem. How to tell hadoop to launch only one map. Thanks,
Re: Hadoop efficient resource isolation
CapacityScheduler has features to allow a user to specify the amount of virtual memory per map/reduce task and the TaskTracker monitors all tasks and their process-trees to ensure fork-bombs don't kill the node. On Feb 25, 2013, at 8:27 PM, Marcin Mejran wrote: That won't stop a bad job (say a fork bomb or a massive memory leak in a streaming script) from taking out a node which is what I believe Dhanasekaran was asking about. He wants to physically isolate certain lobs to certain non critical nodes. I don't believe this is possible and data would be spread to those nodes, assuming they're data nodes, which would still cause cluster wide issues (and if data is isolate why not have two separate clusters?), I've read references in the docs about some type of memory based contrains in Hadoop but I don't know of the details. Anyone know how they work? Also, I believe there are tools in Linux that can kill processes in case of memory issues and otherwise restrict what a certain user can do. These seem like a more flexible solution although they won't cover all potential issues. -Marcin On Feb 25, 2013, at 7:20 PM, Arun C Murthy a...@hortonworks.com wrote: CapacityScheduler is what you want... On Feb 21, 2013, at 5:16 AM, Dhanasekaran Anbalagan wrote: Hi Guys, It's possible isolation job submission for hadoop cluster, we currently running 48 machine cluster. we monitor Hadoop is not provides efficient resource isolation. In my case we ran for tech and research pool, When tech job some memory leak will haven, It's occupy the hole cluster. Finally we figure out issue with tech job. It's screwed up hole hadoop cluster. finally 10 data node are dead. Any prevention of job submission efficient way resource allocation. When something wrong in particular job, effect particular pool, Not effect others job. Any way to archive this Please guide me guys. My idea is, When tech user submit job means only apply job in for my case submit 24 machine. other machine only for research user. It's will prevent the memory leak problem. -Dhanasekaran. Did I learn something today? If not, I wasted it. -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
HDFS Backup for Hadoop Update
Hello guys, I am starting the update from hadoop 0.20 to a newer version which changes HDFS format(2.0). I read a lot of tutorials and they say that data loss is possible (as expected). In order to avoid HDFS data loss I am will probably backup all HDFS structure (7TB per node). However, this is a huge amount of data and it will take a lot of time in which my service would be unavailable. I was thinking about a simple approach: copying all files to a different place. I tried to find some parallel files compactor to fasten the process, but could not find it. How do you guys did it? Is there some trick? Thank you in advance, Pablo Musa
Re: Running terasort with 1 map task
does passing the dfs.block.size=134217728 resolves your issue? or is it something else fixed your problem? On Tue, Feb 26, 2013 at 6:04 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: sorry my bad, it solved On Tue, Feb 26, 2013 at 1:22 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block size property namedfs.block.size/name value134217728/value finaltrue/final /property While running the teragen I am again specifying it to be sure: hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728 32 /user/hadoop/input but it generates 3 blocks: hadoop fsck -blocks -files -locations /user/hadoop/input Status: HEALTHY Total size:32029543 B Total dirs:3 Total files:4 Total blocks (validated):3 (avg. block size 10676514 B) Minimally replicated blocks:3 (100.0 %) What I am doing wrong? How can I generate only one block? On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: Thanks . As Julien said I want to do a performance measurement. Actually, hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map has generated: Total size:3200029737 B Total dirs:3 Total files:5 Total blocks (validated):27 (avg. block size 118519619 B) Thats why so many maps. On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller julien.mul...@ezako.com wrote: Maybe your goal is to have a baseline for performance measurement? In that case, you might want to consider running only one taskTracker? You would have multiple tasks but running on only 1 machine. Also, you could make mappers run serially, by configuring only one map slot on your 1 node cluster. Nevertheless I agree with Bertrand, this is not really a realistic use case (or maybe you can give us more clues). Julien 2013/2/26 Bertrand Dechoux decho...@gmail.com http://wiki.apache.org/hadoop/HowManyMapsAndReduces It is possible to have a single mapper if the input is not splittable BUT it is rarely seen as a feature. One could ask why you want to use a platform for distributed computing for a job that shouldn't be distributed. Regards Bertrand On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: Hi all, I am trying to run terasort using one map and one reduce. so, I generated the input data using: hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map Then I launched the hadoop terasort job using: hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1 I thought it will run the job using 1 map and 1 reduce, but when inspect the job statistics I found: hadoop job -history /user/hadoop/output1 Task Summary KindTotalSuccessfulFailedKilledStartTime FinishTime Setup110026-Feb-2013 10:57:4726-Feb-2013 10:57:55 (8sec) Map24240026-Feb-2013 10:57:5726-Feb-2013 11:05:37 (7mins, 40sec) Reduce110026-Feb-2013 10:58:2126-Feb-2013 11:08:31 (10mins, 10sec) Cleanup110026-Feb-2013 11:08:32 26-Feb-2013 11:08:36 (4sec) so, though I mentioned to launch one map tasks, there are 24 of them. How to solve this problem. How to tell hadoop to launch only one map. Thanks,
Re: JobTracker security
I am trying to not to use kerberos... Is there other option? Thanks Serge On Tue, Feb 26, 2013 at 3:31 PM, Patai Sangbutsarakum patai.sangbutsara...@turn.com wrote: Kerberos From: Serge Blazhievsky hadoop...@gmail.com Reply-To: user@hadoop.apache.org Date: Tue, 26 Feb 2013 15:29:08 -0800 To: user@hadoop.apache.org Subject: JobTracker security Hi all, Is there a way to restrict job monitoring and management only to jobs started by each individual user? The basic scenario is: 1. Start a job under user1 2. Login as user2 3. hadoop job -list to retrieve job id 4. hadoop job -kill job_id 5. Job gets terminated Is there something that needs to be enabled to prevent that from happening? Thanks Serge
Re: HDFS Backup for Hadoop Update
Following the idea of doing a copy of the data structure I thought about rsync. I could run rsync while the server is ON and later just apply the diff, which would be much faster decreasing system off-line time. But I do not know if hadoop make a lot of changes into the data structure (blocks). Thanks again, Pablo On 02/26/2013 07:39 PM, Pablo Musa wrote: Hello guys, I am starting the update from hadoop 0.20 to a newer version which changes HDFS format(2.0). I read a lot of tutorials and they say that data loss is possible (as expected). In order to avoid HDFS data loss I am will probably backup all HDFS structure (7TB per node). However, this is a huge amount of data and it will take a lot of time in which my service would be unavailable. I was thinking about a simple approach: copying all files to a different place. I tried to find some parallel files compactor to fasten the process, but could not find it. How do you guys did it? Is there some trick? Thank you in advance, Pablo Musa
Re: JobTracker security
Maybe restrict access to the hadoop file(s) to the user1? 2013/2/26 Serge Blazhievsky hadoop...@gmail.com: I am trying to not to use kerberos... Is there other option? Thanks Serge On Tue, Feb 26, 2013 at 3:31 PM, Patai Sangbutsarakum patai.sangbutsara...@turn.com wrote: Kerberos From: Serge Blazhievsky hadoop...@gmail.com Reply-To: user@hadoop.apache.org Date: Tue, 26 Feb 2013 15:29:08 -0800 To: user@hadoop.apache.org Subject: JobTracker security Hi all, Is there a way to restrict job monitoring and management only to jobs started by each individual user? The basic scenario is: 1. Start a job under user1 2. Login as user2 3. hadoop job -list to retrieve job id 4. hadoop job -kill job_id 5. Job gets terminated Is there something that needs to be enabled to prevent that from happening? Thanks Serge
Re: JobTracker security
hi Jean, Do you mean input files for hadoop ? or hadoop directory? Serge On Tue, Feb 26, 2013 at 4:38 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Maybe restrict access to the hadoop file(s) to the user1? 2013/2/26 Serge Blazhievsky hadoop...@gmail.com: I am trying to not to use kerberos... Is there other option? Thanks Serge On Tue, Feb 26, 2013 at 3:31 PM, Patai Sangbutsarakum patai.sangbutsara...@turn.com wrote: Kerberos From: Serge Blazhievsky hadoop...@gmail.com Reply-To: user@hadoop.apache.org Date: Tue, 26 Feb 2013 15:29:08 -0800 To: user@hadoop.apache.org Subject: JobTracker security Hi all, Is there a way to restrict job monitoring and management only to jobs started by each individual user? The basic scenario is: 1. Start a job under user1 2. Login as user2 3. hadoop job -list to retrieve job id 4. hadoop job -kill job_id 5. Job gets terminated Is there something that needs to be enabled to prevent that from happening? Thanks Serge
Re: QJM HA and ClusterID
Anybody here? Thanks! On Tue, Feb 26, 2013 at 9:57 AM, Azuryy Yu azury...@gmail.com wrote: Hi all, I've stay on this question several days. I want upgrade my cluster from hadoop-1.0.3 to hadoop-2.0.3-alpha, I've configured QJM successfully. How to customize clusterID by myself. It generated a random clusterID now. It doesn't work when I run: start-dfs.sh -upgrade -clusterId 12345-test Thanks!
Re: JobTracker security
I mean the executable files. Or even the entire hadoop directory? People might still be able to install a local copy of hadoop and configure it to point to the same trackers, and then do the kill, but at least that will complicate the things a bit? If user1 and user2 are on different groups also, that might allow you to block some user2 actions against user1 processes? Also, you should take look to the Security chapter in Hadoop: The Definitive Guide and to the hadoop-policy.xml file (I never looked at this file, so maybe it's not at all related). 2013/2/26 Serge Blazhievsky hadoop...@gmail.com: hi Jean, Do you mean input files for hadoop ? or hadoop directory? Serge On Tue, Feb 26, 2013 at 4:38 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Maybe restrict access to the hadoop file(s) to the user1? 2013/2/26 Serge Blazhievsky hadoop...@gmail.com: I am trying to not to use kerberos... Is there other option? Thanks Serge On Tue, Feb 26, 2013 at 3:31 PM, Patai Sangbutsarakum patai.sangbutsara...@turn.com wrote: Kerberos From: Serge Blazhievsky hadoop...@gmail.com Reply-To: user@hadoop.apache.org Date: Tue, 26 Feb 2013 15:29:08 -0800 To: user@hadoop.apache.org Subject: JobTracker security Hi all, Is there a way to restrict job monitoring and management only to jobs started by each individual user? The basic scenario is: 1. Start a job under user1 2. Login as user2 3. hadoop job -list to retrieve job id 4. hadoop job -kill job_id 5. Job gets terminated Is there something that needs to be enabled to prevent that from happening? Thanks Serge
Re: JobTracker security
All right! Thanks for advice! Serge On Tue, Feb 26, 2013 at 4:57 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: I mean the executable files. Or even the entire hadoop directory? People might still be able to install a local copy of hadoop and configure it to point to the same trackers, and then do the kill, but at least that will complicate the things a bit? If user1 and user2 are on different groups also, that might allow you to block some user2 actions against user1 processes? Also, you should take look to the Security chapter in Hadoop: The Definitive Guide and to the hadoop-policy.xml file (I never looked at this file, so maybe it's not at all related). 2013/2/26 Serge Blazhievsky hadoop...@gmail.com: hi Jean, Do you mean input files for hadoop ? or hadoop directory? Serge On Tue, Feb 26, 2013 at 4:38 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Maybe restrict access to the hadoop file(s) to the user1? 2013/2/26 Serge Blazhievsky hadoop...@gmail.com: I am trying to not to use kerberos... Is there other option? Thanks Serge On Tue, Feb 26, 2013 at 3:31 PM, Patai Sangbutsarakum patai.sangbutsara...@turn.com wrote: Kerberos From: Serge Blazhievsky hadoop...@gmail.com Reply-To: user@hadoop.apache.org Date: Tue, 26 Feb 2013 15:29:08 -0800 To: user@hadoop.apache.org Subject: JobTracker security Hi all, Is there a way to restrict job monitoring and management only to jobs started by each individual user? The basic scenario is: 1. Start a job under user1 2. Login as user2 3. hadoop job -list to retrieve job id 4. hadoop job -kill job_id 5. Job gets terminated Is there something that needs to be enabled to prevent that from happening? Thanks Serge
Re: QJM HA and ClusterID
looks start-dfs.sh has a bug. It only takes -upgrade option and ignores clusterId. Consider running the command (which is what start-dfs.sh calls): bin/hdfs start namenode -upgrade -clusterId your cluster ID Please file a bug, if you can, for start-dfs.sh bug which ignores additional parameters. On Tue, Feb 26, 2013 at 4:50 PM, Azuryy Yu azury...@gmail.com wrote: Anybody here? Thanks! On Tue, Feb 26, 2013 at 9:57 AM, Azuryy Yu azury...@gmail.com wrote: Hi all, I've stay on this question several days. I want upgrade my cluster from hadoop-1.0.3 to hadoop-2.0.3-alpha, I've configured QJM successfully. How to customize clusterID by myself. It generated a random clusterID now. It doesn't work when I run: start-dfs.sh -upgrade -clusterId 12345-test Thanks! -- http://hortonworks.com/download/
Concatenate adjacent lines with hadoop
Hi Please find below the issue I need to solve. Thank you in advance for your help/ tips. I have log files where sometimes log lines are splited (this happens when the log line exceeds a specific length) Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-MAX Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-N === log line is being splitted Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] FIELD-NTABFIELD-N+1 .FIELD-MAX Can I reconcile/ concatenate splited log lines with a hadoop map reduce job? On other words, using a map reduce job, can I concatenate the 2 following adjacent lines (provided that I 'detect' them) Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-N === log line is being splitted Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] FIELD-NTABFIELD-N+1 .FIELD-MAX into Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-NTABFIELD-N+1 .FIELD-MAX Thank you!
Re: Datanodes shutdown and HBase's regionservers not working
Hi Davey, So were you able to find the issue? JM 2013/2/25 Davey Yan davey@gmail.com: Hi Nicolas, I think i found what led to shutdown of all of the datanodes, but i am not completely certain. I will return to this mail list when my cluster returns to be stable. On Mon, Feb 25, 2013 at 8:01 PM, Nicolas Liochon nkey...@gmail.com wrote: Network error messages are not always friendly, especially if there is a misconfiguration. This said, connection refused says that the network connection was made, but that the remote port was not opened on the remote box. I.e. the process was dead. It could be useful to pastebin the whole logs as well... On Mon, Feb 25, 2013 at 12:44 PM, Davey Yan davey@gmail.com wrote: But... there was no log like network unreachable. On Mon, Feb 25, 2013 at 6:07 PM, Nicolas Liochon nkey...@gmail.com wrote: I agree. Then for HDFS, ... The first thing to check is the network I would say. On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan davey@gmail.com wrote: Thanks for reply, Nicolas. My question: What can lead to shutdown of all of the datanodes? I believe that the regionservers will be OK if the HDFS is OK. On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon nkey...@gmail.com wrote: Ok, what's your question? When you say the datanode went down, was it the datanode processes or the machines, with both the datanodes and the regionservers? The NameNode pings its datanodes every 3 seconds. However it will internally mark the datanodes as dead after 10:30 minutes (even if in the gui you have 'no answer for x minutes'). HBase monitoring is done by ZooKeeper. By default, a regionserver is considered as dead after 180s with no answer. Before, well, it's considered as live. When you stop a regionserver, it tries to flush its data to the disk (i.e. hdfs, i.e. the datanodes). That's why if you have no datanodes, or if a high ratio of your datanodes are dead, it can't shutdown. Connection refused socket timeouts come from the fact that before the 10:30 minutes hdfs does not declare the nodes as dead, so hbase tries to use them (and, obviously, fails). Note that there is now an intermediate state for hdfs datanodes, called stale: an intermediary state where the datanode is used only if you have to (i.e. it's the only datanode with a block replica you need). It will be documented in HBase for the 0.96 release. But if all your datanodes are down it won't change much. Cheers, Nicolas On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan davey@gmail.com wrote: Hey guys, We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than 1 year, and it works fine. But the datanodes got shutdown twice in the last month. When the datanodes got shutdown, all of them became Dead Nodes in the NN web admin UI(http://ip:50070/dfshealth.jsp), but regionservers of HBase were still live in the HBase web admin(http://ip:60010/master-status), of course, they were zombies. All of the processes of jvm were still running, including hmaster/namenode/regionserver/datanode. When the datanodes got shutdown, the load (using the top command) of slaves became very high, more than 10, higher than normal running. From the top command, we saw that the processes of datanode and regionserver were comsuming CPU. We could not stop the HBase or Hadoop cluster through normal commands(stop-*.sh/*-daemon.sh stop *). So we stopped datanodes and regionservers by kill -9 PID, then the load of slaves returned to normal level, and we start the cluster again. Log of NN at the shutdown point(All of the DNs were removed): 2013-02-22 11:10:02,278 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.1.152:50010 2013-02-22 11:10:02,278 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.1.149:50010 2013-02-22 11:10:02,693 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.1.149:50010 2013-02-22 11:10:02,693 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.1.150:50010 2013-02-22 11:10:03,004 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.1.150:50010 2013-02-22 11:10:03,004 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.1.148:50010 2013-02-22 11:10:03,339 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.1.148:50010 Logs in DNs indicated there were many IOException and SocketTimeoutException: 2013-02-22 11:02:52,354 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
Re: Concatenate adjacent lines with hadoop
That's easy, in your example, Map output key: FIELD-N ; Map output value: just original value. In the reduece: if there is LOGTAGTAB in the value, then this is the first log entry. if not, this is a splitted log entry. just get a sub string and concat with the first log entry. Am I explain clearly? On Wed, Feb 27, 2013 at 9:36 AM, Matthieu Labour matth...@actionx.comwrote: Hi Please find below the issue I need to solve. Thank you in advance for your help/ tips. I have log files where sometimes log lines are splited (this happens when the log line exceeds a specific length) Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-MAX Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-N === log line is being splitted Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] FIELD-NTABFIELD-N+1 .FIELD-MAX Can I reconcile/ concatenate splited log lines with a hadoop map reduce job? On other words, using a map reduce job, can I concatenate the 2 following adjacent lines (provided that I 'detect' them) Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-N === log line is being splitted Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] FIELD-NTABFIELD-N+1 .FIELD-MAX into Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-NTABFIELD-N+1 .FIELD-MAX Thank you!
Re: Concatenate adjacent lines with hadoop
Thank you for your answer. I am not sure i understand fully. My email was most likely not very clear. Here is an example of log line. Please note the beginning of the log line YSLOGROW. Please note that the second line should be concatenated with the first line. Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] YSLOGROW 20121216T214720.345Z remote-addr=166.137.156.155user-agent=Mozilla%2F5.0+%28Linux%3B+U%3B+Android+4.0.4%3B+en-us%3B+SAMSUNG-SGH-I717+Build%2FIMM76D%29+AppleWebKit%2F534.30+%28KHTML%2C+like+Gecko%29+Version%2F4.0+Mobile+Safari%2F534.30referrer=http%3A%2F% 2Flp.mydas.mobi %2F%2Frich%2Ffoundation%2FdynamicInterstitial%2Fint_launch.php%3Fmm_urid%3DWBNMMG9h4XmbJBUHbDrNWWWm%26mm_ipaddress%3D166.137.156.155%26mm_handset%3D8440%26mm_carrier%3D2%26mm_apid%3D78683%26mm_acid%3D1050500%26mm_osid%3D14%26mm_uip%3D166.137.156.155%26mm_ua%3DMozilla%252F5.0%2B%2528Linux%253B%2BU%253B%2BAndroid%2B4.0.4%253B%2Ben-us%253B%2BSAMSUNG-SGH-I717%2BBuild%252FIMM76D%2529%2BAppleWebKit%252F534.30%2B%2528KHTML%252C%2Blike%2BGecko%2529%2BVersion%252F4.0%2BMobile%2BSafari%252F534.30SAMSUNG-SGH-I717%26mtpid%3DUNKNOWN%26mm_msuid%3DUNKNOWN%26mm_mmisdk%3D4.6.0-12.07.16.a%26mm_mxsdk%3DUNKNOWN%26mm_dv%3DAndroid4.0.4%26mm_adtype%3DMMFullScreenAdTransition%26mm_hswd%3DUNKNOWN%26mm_dm%3DSAMSUNG-SGH-I717%26mm_hsht%3DUNKNOWN%26mm_auid%3Dmmi Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] d_bd6b33dc569994102eaa60a060987d99e9_013b35a758bd%26mm_accelerometer%3Dtrue%26mm_lat%3DUNKNOWN%26mm_long%3DUNKNOWN%26mm_hpx%3D1280%26mm_wpx%3D800%26mm_density%3D2.0%26mm_dpi%3DUNKNOWN%26mm_campaignid%3D45695%26autoExpand%3Dtruequery-string=ncid%3DWBNMMG9h4XmbJBUHbDrNWWWm tr7y MLNL 1009 10034 3401 t4fx 10034 click On Tue, Feb 26, 2013 at 9:39 PM, Azuryy Yu azury...@gmail.com wrote: That's easy, in your example, Map output key: FIELD-N ; Map output value: just original value. In the reduece: if there is LOGTAGTAB in the value, then this is the first log entry. if not, this is a splitted log entry. just get a sub string and concat with the first log entry. Am I explain clearly? On Wed, Feb 27, 2013 at 9:36 AM, Matthieu Labour matth...@actionx.comwrote: Hi Please find below the issue I need to solve. Thank you in advance for your help/ tips. I have log files where sometimes log lines are splited (this happens when the log line exceeds a specific length) Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-MAX Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-N === log line is being splitted Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] FIELD-NTABFIELD-N+1 .FIELD-MAX Can I reconcile/ concatenate splited log lines with a hadoop map reduce job? On other words, using a map reduce job, can I concatenate the 2 following adjacent lines (provided that I 'detect' them) Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-N === log line is being splitted Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] FIELD-NTABFIELD-N+1 .FIELD-MAX into Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-NTABFIELD-N+1 .FIELD-MAX Thank you! -- Matthieu Labour, Engineering | *Action**X* | 584 Broadway, Suite 1002 – NY, NY 10012 415-994-3480 (m)
Re: Concatenate adjacent lines with hadoop
I just noticed your two lines are all started with: Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app does that different for other lines? if your answer is yes, then just using this prefix as map output key. On Wed, Feb 27, 2013 at 1:01 PM, Matthieu Labour matth...@actionx.comwrote: Thank you for your answer. I am not sure i understand fully. My email was most likely not very clear. Here is an example of log line. Please note the beginning of the log line YSLOGROW. Please note that the second line should be concatenated with the first line. Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] YSLOGROW 20121216T214720.345Z remote-addr=166.137.156.155user-agent=Mozilla%2F5.0+%28Linux%3B+U%3B+Android+4.0.4%3B+en-us%3B+SAMSUNG-SGH-I717+Build%2FIMM76D%29+AppleWebKit%2F534.30+%28KHTML%2C+like+Gecko%29+Version%2F4.0+Mobile+Safari%2F534.30referrer=http%3A%2F% 2Flp.mydas.mobi %2F%2Frich%2Ffoundation%2FdynamicInterstitial%2Fint_launch.php%3Fmm_urid%3DWBNMMG9h4XmbJBUHbDrNWWWm%26mm_ipaddress%3D166.137.156.155%26mm_handset%3D8440%26mm_carrier%3D2%26mm_apid%3D78683%26mm_acid%3D1050500%26mm_osid%3D14%26mm_uip%3D166.137.156.155%26mm_ua%3DMozilla%252F5.0%2B%2528Linux%253B%2BU%253B%2BAndroid%2B4.0.4%253B%2Ben-us%253B%2BSAMSUNG-SGH-I717%2BBuild%252FIMM76D%2529%2BAppleWebKit%252F534.30%2B%2528KHTML%252C%2Blike%2BGecko%2529%2BVersion%252F4.0%2BMobile%2BSafari%252F534.30SAMSUNG-SGH-I717%26mtpid%3DUNKNOWN%26mm_msuid%3DUNKNOWN%26mm_mmisdk%3D4.6.0-12.07.16.a%26mm_mxsdk%3DUNKNOWN%26mm_dv%3DAndroid4.0.4%26mm_adtype%3DMMFullScreenAdTransition%26mm_hswd%3DUNKNOWN%26mm_dm%3DSAMSUNG-SGH-I717%26mm_hsht%3DUNKNOWN%26mm_auid%3Dmmi Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] d_bd6b33dc569994102eaa60a060987d99e9_013b35a758bd%26mm_accelerometer%3Dtrue%26mm_lat%3DUNKNOWN%26mm_long%3DUNKNOWN%26mm_hpx%3D1280%26mm_wpx%3D800%26mm_density%3D2.0%26mm_dpi%3DUNKNOWN%26mm_campaignid%3D45695%26autoExpand%3Dtruequery-string=ncid%3DWBNMMG9h4XmbJBUHbDrNWWWm tr7y MLNL 1009 10034 3401 t4fx 10034 click On Tue, Feb 26, 2013 at 9:39 PM, Azuryy Yu azury...@gmail.com wrote: That's easy, in your example, Map output key: FIELD-N ; Map output value: just original value. In the reduece: if there is LOGTAGTAB in the value, then this is the first log entry. if not, this is a splitted log entry. just get a sub string and concat with the first log entry. Am I explain clearly? On Wed, Feb 27, 2013 at 9:36 AM, Matthieu Labour matth...@actionx.comwrote: Hi Please find below the issue I need to solve. Thank you in advance for your help/ tips. I have log files where sometimes log lines are splited (this happens when the log line exceeds a specific length) Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-MAX Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-N === log line is being splitted Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] FIELD-NTABFIELD-N+1 .FIELD-MAX Can I reconcile/ concatenate splited log lines with a hadoop map reduce job? On other words, using a map reduce job, can I concatenate the 2 following adjacent lines (provided that I 'detect' them) Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-N === log line is being splitted Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] FIELD-NTABFIELD-N+1 .FIELD-MAX into Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] LOGTAGTABFIELD-0TABTABFIELD-NTABFIELD-N+1 .FIELD-MAX Thank you! -- Matthieu Labour, Engineering | *Action**X* | 584 Broadway, Suite 1002 – NY, NY 10012 415-994-3480 (m)
java.lang.NumberFormatException and Thanks to Hemanth and Harsh
Hi all, First, I would like to thank you all, espacially to Hemanth and Harsh. I solved my problem, this was exactly the about java version and hadoop version incompatibility, now, I can run my compiled and jarred MapReduce program. I have a different question now. I created a code, finding the each ip's packet number in a given time interval for the netflow data. However, I am getting the java.lang.NumberFormatException. * 1. Here is my code in Java; * package org.myorg; import java.io.IOException; import org.apache.hadoop.io.*; import java.util.NoSuchElementException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Mapper; import java.util.StringTokenizer; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class MapReduce extends Configured implements Tool { public int run (String[] args) throws Exception { System.out.println(Debug1); if(args.length != 2) { System.err.println(Usage: MapReduce input path output path); ToolRunner.printGenericCommandUsage(System.err); } Job job = new Job(); job.setJarByClass(MapReduce.class); System.out.println(Debug2); job.setJobName(MaximumPacketFlowIP); System.out.println(Debug3); FileInputFormat.addInputPath(job, new Path(args[0])); System.out.println(Debug8); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.out.println(Debug9); job.setMapperClass(FlowPortMapper.class); System.out.println(Debug6); job.setReducerClass(FlowPortReducer.class); System.out.println(Debug7); job.setOutputKeyClass(Text.class); System.out.println(Debug4); job.setOutputValueClass(IntWritable.class); System.out.println(Debug5); //System.exit(job.waitForCompletion(true) ? 0:1); return job.waitForCompletion(true) ? 0:1 ; } /* --main-*/ public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new MapReduce(), args); System.exit(exitCode); } /*-Mapper-*/ static class FlowPortMapper extends Mapper LongWritable,Text,Text,IntWritable { public void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException { String flow = value.toString(); long starttime=0; long endtime=0; long time1=1357289339; long time2=1357289342; StringTokenizer line = new StringTokenizer(flow); String internalip=i; //Getting the internalip from flow if(line.hasMoreTokens()) internalip=line.nextToken(); //Getting the starttime and endtime from flow for(int i=0;i9;i++) if(line.hasMoreTokens()) starttime=Long.parseLong(line.nextToken()); if(line.hasMoreTokens()) endtime=Long.parseLong(line.nextToken()); //If the time is in the given interval then emit 1 if(starttime=time1 endtime=time2) context.write(new Text(internalip), new IntWritable(1)); } } /* Reducer---*/ static class FlowPortReducer extends ReducerText,IntWritable,Text,IntWritable { public void reduce(Text key,IterableIntWritable values, Context context) throws