Re: setNumTasks
Could someone please help me answer this question? On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.comwrote: What is the corresponding system property for setNumTasks? Can it be used explicitly as system property like mapred.tasks.?
Re: hadoop permission guideline
Can you please take this discussion CDH mailing list? On Mar 22, 2012, at 7:51 AM, Michael Wang michael.w...@meredith.com wrote: I have installed Cloudera hadoop (CDH). I used its Cloudera Manager to install all needed packages. When it was installed, the root is used. I found the installation created some users, such as hdfs, hive, mapred,hue,hbase... After the installation, should we change some permission or ownership of some directories/files? For example, to use HIVE. It works fine with root user, since the metatore directory belongs to root. But in order to let other user use HIVE, I have to change metastore ownership to a specific non-root user, then it works. Is it the best practice? Another example is the start-all.sh, stop-all.sh they all belong to root. Should I change them to other user? I guess there are more cases... Thanks, This electronic message, including any attachments, may contain proprietary, confidential or privileged information for the sole use of the intended recipient(s). You are hereby notified that any unauthorized disclosure, copying, distribution, or use of this message is prohibited. If you have received this message in error, please immediately notify the sender by reply e-mail and delete it.
Re: setNumTasks
Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's confusing as to what it's purpose is for? I tried setting it for my job still I see more map tasks running than *mapred.map.tasks* On Thu, Mar 22, 2012 at 7:53 AM, Harsh J ha...@cloudera.com wrote: There isn't such an API as setNumTasks. There is however, setNumReduceTasks, which sets mapred.reduce.tasks. Does this answer your question? On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Could someone please help me answer this question? On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.com wrote: What is the corresponding system property for setNumTasks? Can it be used explicitly as system property like mapred.tasks.? -- Harsh J
Re: setNumTasks
Hi Mohit The number of map tasks is determined by your number of input splits and the Input Format used by your MR job. Setting this value won't help you control the same. AFAIK it would get effective if the value in mapred.map.tasks is greater than the no of tasks calculated by the Job based on the splits and Input Format. Regards Bejoy KS On Thu, Mar 22, 2012 at 8:28 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's confusing as to what it's purpose is for? I tried setting it for my job still I see more map tasks running than *mapred.map.tasks* On Thu, Mar 22, 2012 at 7:53 AM, Harsh J ha...@cloudera.com wrote: There isn't such an API as setNumTasks. There is however, setNumReduceTasks, which sets mapred.reduce.tasks. Does this answer your question? On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Could someone please help me answer this question? On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.com wrote: What is the corresponding system property for setNumTasks? Can it be used explicitly as system property like mapred.tasks.? -- Harsh J
Re: hadoop permission guideline
Hi Michael, Am moving your question to the scm-us...@cloudera.org group which is home to the community of Cloudera Manager users. You will get better responses here. In case you wish to browse or subscribe to this group, visit https://groups.google.com/a/cloudera.org/forum/#!forum/scm-users (BCC'd common-user@) On Thu, Mar 22, 2012 at 8:21 PM, Michael Wang michael.w...@meredith.com wrote: I have installed Cloudera hadoop (CDH). I used its Cloudera Manager to install all needed packages. When it was installed, the root is used. I found the installation created some users, such as hdfs, hive, mapred,hue,hbase... After the installation, should we change some permission or ownership of some directories/files? For example, to use HIVE. It works fine with root user, since the metatore directory belongs to root. But in order to let other user use HIVE, I have to change metastore ownership to a specific non-root user, then it works. Is it the best practice? Another example is the start-all.sh, stop-all.sh they all belong to root. Should I change them to other user? I guess there are more cases... Thanks, This electronic message, including any attachments, may contain proprietary, confidential or privileged information for the sole use of the intended recipient(s). You are hereby notified that any unauthorized disclosure, copying, distribution, or use of this message is prohibited. If you have received this message in error, please immediately notify the sender by reply e-mail and delete it. -- Harsh J
Re: setNumTasks
If you want to control the number of input splits at fine granularity, you could customize the NLineInputFormat. You need to determine the number of lines per each split. Thus you need to know before is the number of lines in your input data, for instance, using hadoop -text /input/dir/* | wc -l will give you a number, lets assume it is N If you have K number of nodes, each nodes has C number of core, basically you could start K*C number of mapper jobs. And you want to further assume each mapper process 2 splits (in case that some jobs are finished earlier), therefore the optimal number of lines in NLineInputFormat is around N/(2*K*C) Thus might give you an optimal job balance. Remember, the NLineInputFormat usually takes longer time than other input format to initialize, and the line split only concerns about number of lines, but is unaware about the content length per each line. Thus, in sequence data analysis is some lines are significantly longer than other lines, the mapper assigned with longer lines will be much slower than those assigned with smaller lines. So randomly mixing short and long lines before split is more preferable. Shi On 3/22/2012 10:01 AM, Bejoy Ks wrote: Hi Mohit The number of map tasks is determined by your number of input splits and the Input Format used by your MR job. Setting this value won't help you control the same. AFAIK it would get effective if the value in mapred.map.tasks is greater than the no of tasks calculated by the Job based on the splits and Input Format. Regards Bejoy KS On Thu, Mar 22, 2012 at 8:28 PM, Mohit Anchliamohitanch...@gmail.comwrote: Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's confusing as to what it's purpose is for? I tried setting it for my job still I see more map tasks running than *mapred.map.tasks* On Thu, Mar 22, 2012 at 7:53 AM, Harsh Jha...@cloudera.com wrote: There isn't such an API as setNumTasks. There is however, setNumReduceTasks, which sets mapred.reduce.tasks. Does this answer your question? On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchliamohitanch...@gmail.com wrote: Could someone please help me answer this question? On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchliamohitanch...@gmail.com wrote: What is the corresponding system property for setNumTasks? Can it be used explicitly as system property like mapred.tasks.? -- Harsh J
Re: rack awareness and safemode
I restarted the cluster yesterday with rack-awareness enable. Things went well. confirm that there was no issues at all. Thanks you all again. On Tue, Mar 20, 2012 at 4:19 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Thanks you all. On Tue, Mar 20, 2012 at 2:44 PM, Harsh J ha...@cloudera.com wrote: John has already addressed your concern. I'd only like to add that fixing of replication violations does not require your NN to be in safe mode and it won't be. Your worry can hence be voided :) On Wed, Mar 21, 2012 at 2:08 AM, Patai Sangbutsarakum patai.sangbutsara...@turn.com wrote: Thanks for your reply and script. Hopefully it still apply to 0.20.203 As far as I play with test cluster. The balancer would take care of replica placement. I just don't want to fall into the situation that the hdfs sit in the safemode for hours and users can't use hadoop and start yelping. Let's hear from others. Thanks Patai On 3/20/12 1:27 PM, John Meagher john.meag...@gmail.com wrote: ere's the script I used (all sorts of caveats about it assuming a replication factor of 3 and no real error handling, etc)... for f in `hadoop fsck / | grep Replica placement policy is violated | head -n8 | awk -F: '{print $1}'`; do hadoop fs -setrep -w 4 $f hadoop fs -setrep 3 $f done -- Harsh J
Re: rack awareness and safemode
Make sure you run hadoop fsck /. It should report a lot of blocks with the replication policy violated. In the sort term it isn't anything to worry about and everything will work fine even with those errors. Run the script I sent out earlier to fix those errors and bring everything into compliance with the new rack awareness setup. On Thu, Mar 22, 2012 at 13:36, Patai Sangbutsarakum silvianhad...@gmail.com wrote: I restarted the cluster yesterday with rack-awareness enable. Things went well. confirm that there was no issues at all. Thanks you all again. On Tue, Mar 20, 2012 at 4:19 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Thanks you all. On Tue, Mar 20, 2012 at 2:44 PM, Harsh J ha...@cloudera.com wrote: John has already addressed your concern. I'd only like to add that fixing of replication violations does not require your NN to be in safe mode and it won't be. Your worry can hence be voided :) On Wed, Mar 21, 2012 at 2:08 AM, Patai Sangbutsarakum patai.sangbutsara...@turn.com wrote: Thanks for your reply and script. Hopefully it still apply to 0.20.203 As far as I play with test cluster. The balancer would take care of replica placement. I just don't want to fall into the situation that the hdfs sit in the safemode for hours and users can't use hadoop and start yelping. Let's hear from others. Thanks Patai On 3/20/12 1:27 PM, John Meagher john.meag...@gmail.com wrote: ere's the script I used (all sorts of caveats about it assuming a replication factor of 3 and no real error handling, etc)... for f in `hadoop fsck / | grep Replica placement policy is violated | head -n8 | awk -F: '{print $1}'`; do hadoop fs -setrep -w 4 $f hadoop fs -setrep 3 $f done -- Harsh J
Re: tasktracker/jobtracker.. expectation..
Hi Patai JobTracker automatically handles this situation by attempting the task on different nodes.Could you verify the number of attempts that these failed tasks made. Was that just one? If more whether all the task attempts were triggered on the same node or not? Did all of them fail with the same error? You can get this information from the jobtracker web UI, drill down to task level and then further down a failed task. Regards Bejoy On Thu, Mar 22, 2012 at 11:25 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Hi all, I have a job fail this morning because of 2 tasks were trying to write into disk that somehow turned read-only. Originally, i was thinking/dreaming that in this case somehow those 2 tasks will be exported automatically to other dn/tt that also has the required data block, and won't fail. I strongly believe that Hadoop can do that but i just didn't know it well enough to enable it. /dev/sdj1 /hadoop10 ext3 ro,noatime,data=ordered 0 0 Error initializing attempt_201203211854_2633_m_17_0: EROFS: Read-only file system at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:319) at org.apache.hadoop.mapred.JobLocalizer.createLocalDirs(JobLocalizer.java:144) at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:190) at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221) Hope this make sense. Patai
Re: Number of retries
Mohit If you are writing to a db from a job in an atomic way, this would pop up. You can avoid this only by disabling speculative execution. Drilling down from web UI to a task level would get you the tasks where multiple attempts were there. --Original Message-- From: Mohit Anchlia To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Number of retries Sent: Mar 23, 2012 01:21 I am seeing wierd problem where I am seeing duplicate rows in the database. I am wondering if this is because of some internal retries that might be causing this. Is there a way to look at which tasks were retried? I am not sure what else might cause because when I look at the output data I don't see any duplicates in the file. Regards Bejoy KS Sent from handheld, please excuse typos.
Re: Number of retries
Hi Mohit To add on, duplicates won't be there if your output is written to a hdfs file. Because if one attempt of a task is completed only that output file is copied to the final output destn and the files generated by other task attempts that are killed are just ignored. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Bejoy KS bejoy.had...@gmail.com Date: Thu, 22 Mar 2012 19:55:55 To: common-user@hadoop.apache.org Reply-To: bejoy.had...@gmail.com Subject: Re: Number of retries Mohit If you are writing to a db from a job in an atomic way, this would pop up. You can avoid this only by disabling speculative execution. Drilling down from web UI to a task level would get you the tasks where multiple attempts were there. --Original Message-- From: Mohit Anchlia To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Number of retries Sent: Mar 23, 2012 01:21 I am seeing wierd problem where I am seeing duplicate rows in the database. I am wondering if this is because of some internal retries that might be causing this. Is there a way to look at which tasks were retried? I am not sure what else might cause because when I look at the output data I don't see any duplicates in the file. Regards Bejoy KS Sent from handheld, please excuse typos.
number of partitions
I wrote a custom partitioner. But when I work as standalone or pseudo-distributed mode, the number of partitions is always 1. I set the numberOfReducer to 4, but the numOfPartitions parameter of custom partitioner is still 1 and all my four mappers' results are going to 1 reducer. The other reducers yield empty files. How can i set the number of partitions in standalone or pseudo-distributed mode? thanks for your helps.
hadoop on cygwin : tasktrakker is throwing error : need helpv
I have installed hadoop on cygwin to help me to write MR code in windows eclipse. 2012-03-22 22:19:57,896 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.io.IOException: Failed to set permissions of path: \tmp\hadoop-uygwin\mapred\local\ttprivate to 0700 at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:682) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:655) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:726) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1457) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3716) 2012-03-22 22:19:57,897 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: Config details - OS : Win 7 Hadoop : hadoop-1.0.1 Please let me know if you can help. -Santosh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.