Re: How to stop a mapreduce job from terminal running on Hadoop Cluster?
Also, mapred job -kill job_id On Sun, Apr 12, 2015 at 11:07 AM, Shahab Yunus shahab.yu...@gmail.com wrote: You can kill t by using the following yarn command yarn application -kill application id https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html Or use old hadoop job command http://stackoverflow.com/questions/11458519/how-to-kill-hadoop-jobs Regards, Shahab On Sun, Apr 12, 2015 at 2:03 PM, Answer Agrawal yrsna.tse...@gmail.com wrote: To run a job we use the command $ hadoop jar example.jar inputpath outputpath If job is so time taken and we want to stop it in middle then which command is used? Or is there any other way to do that? Thanks,
Re: Custom FileInputFormat.class
Can you expand on your use case a little bit please? It may be that you're duplicating functionality. You can take a look at the CombineFileInputFormat for inspiration. If this is indeed taking a long time, one cheap to implement thing you can do is to parallelize the calls to get block locations. Another question to ask yourself is whether it is worth it to optimize this portion. In many use cases, (certainly mine), the bottleneck is the running job itself. So the launch overhead is comparatively minimal. Hope this helps. Pradeep On Mon Dec 01 2014 at 8:38:30 AM 胡斐 hufe...@gmail.com wrote: Hi, I want to custom FileInputFormat.class. In order to determine which host the specific part of a file belongs to, I need to open the file in HDFS and read some information. It will take me nearly 500ms to open a file and get the information I need. But now I have thousands of files to deal with, so it would be a long time if I deal with all of them as the above. Is there any better solution to reduce the time when the number of files is large? Thanks in advance! Fei
Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects
I agree with the answers suggested above. 3. B 4. D 5. C On Mon, Oct 6, 2014 at 2:58 PM, Ulul had...@ulul.org wrote: Hi No, Pig is a data manipulation language for data already in Hadoop. The question is about importing data from OLTP DB (eg Oracle, MySQL...) to Hadoop, this is what Sqoop is for (SQL to Hadoop) I'm not certain certification guys are happy with their exam questions ending up on blogs and mailing lists :-) Ulul Le 06/10/2014 13:54, unmesha sreeveni a écrit : what about the last one? The answer is correct. Pig. Is nt it? On Mon, Oct 6, 2014 at 4:29 PM, adarsh deshratnam adarsh.deshrat...@gmail.com wrote: For question 3 answer should be B and for question 4 answer should be D. Thanks, Adarsh D Consultant - BigData and Cloud [image: View my profile on LinkedIn] http://in.linkedin.com/in/adarshdeshratnam On Mon, Oct 6, 2014 at 2:25 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Hi 5 th question can it be SQOOP? On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Yes On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar skumar.bigd...@hotmail.com wrote: Are you preparing g for Cloudera certification exam? Thanks and Regards, Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh (510) 936-2650 Sr Data Consultant - BigData Implementations. [image: View my profile on LinkedIn] http://www.linkedin.com/in/sinhasantosh *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com] *Sent:* Monday, October 06, 2014 12:45 AM *To:* User - Hive; User Hadoop; User Pig *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B * *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B * *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B * *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: datanode down, disk replaced , /etc/fstab changed. Can't bring it back up. Missing lock file?
Looks like you're facing the same problem as this SO. http://stackoverflow.com/questions/10705140/hadoop-datanode-fails-to-start-throwing-org-apache-hadoop-hdfs-server-common-sto Try the suggested fix. On Fri, Oct 3, 2014 at 6:57 PM, Colin Kincaid Williams disc...@uw.edu wrote: We had a datanode go down, and our datacenter guy swapped out the disk. We had moved to using UUIDs in the /etc/fstab, but he wanted to use the /dev/id format. He didn't backup the fstab, however I'm not sure that's the issue. I am reading in the log below that the namenode has a lock on the disk? I don't know how that works. I thought the lockfile would belong to the datanode itself. How do I remove the lock from the namenode to bring the datanode back up? If that's not the issue, how can I bring the datanode back up? Help would be greatly appreciated. 2014-10-03 18:28:18,121 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = us3sm2hb027r09.comp.prod.local/10.51.28.172 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.3.0-cdh5.0.1 STARTUP_MSG: classpath =
Re: Block placement without rack aware
It appears to be randomly chosen. I just came across this blog post from Lars George about HBase file locality in HDFS http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html On Thu, Oct 2, 2014 at 4:12 PM, SF Hadoop sfhad...@gmail.com wrote: What is the block placement policy hadoop follows when rack aware is not enabled? Does it just round robin? Thanks.
Rolling upgrades
Hi All, Is it possible to do a rolling upgrade from Hadoop 2.2 to 2.4? Thanks, Pradeep
Re: YARN creates only 1 container
I believe it's behaving as expected. It will spawn 64 containers because that's how much memory you have available. The vcores isn't harshly enforced since CPUs can be elastic. This blog from cloudera explain how to enforce CPU limits using CGroups. http://blog.cloudera.com/blog/2013/12/managing-multiple-resources-in-hadoop-2-with-yarn/ On Tue, May 27, 2014 at 8:56 PM, hari harib...@gmail.com wrote: The issue was not related the configuration related to containers. Due to misconfiguration, the Application master was not able to contact resourcemanager causing in the 1 container problem. However, the total containers allocated still is not as expected. The configuration settings should have resulted in 16 containers per node, but it is allocating 64 containers per node. Reiterating the config parameters here again: mapred-site.xml mapreduce.map.cpu.vcores = 1 mapreduce.reduce.cpu.vcores = 1 mapreduce.map.memory.mb = 1024 mapreduce.reduce.memory.mb = 1024 mapreduce.map.java.opts = -Xmx1024m mapreduce.reduce.java.opts = -Xmx1024m yarn.xml yarn.nodemanager.resource.memory-mb = 65536 yarn.nodemanager.resource.cpu-vcores = 16 yarn.scheduler.minimum-allocation-mb = 1024 yarn.scheduler.maximum-allocation-mb = 2048 yarn.scheduler.minimum-allocation-vcores = 1 yarn.scheduler.maximum-allocation-vcores = 1 Is there anything else that might be causing this problem ? thanks, hari On Tue, May 27, 2014 at 3:31 AM, hari harib...@gmail.com wrote: Hi, When using YARN 2.2.0 version, only 1 container is created for an application in the entire cluster. The single container is created at an arbitrary node for every run. This happens when running any application from the examples jar (e.g., wordcount). Currently only one application is run at a time. The input datasize is 200GB. I am setting custom values that affect concurrent container count. These config parameters were mostly taken from: http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/ These wasn't much description elsewhere on how the container count would be decided. The settings are: mapred-site.xml mapreduce.map.cpu.vcores = 1 mapreduce.reduce.cpu.vcores = 1 mapreduce.map.memory.mb = 1024 mapreduce.reduce.memory.mb = 1024 mapreduce.map.java.opts = -Xmx1024m mapreduce.reduce.java.opts = -Xmx1024m yarn.xml yarn.nodemanager.resource.memory-mb = 65536 yarn.nodemanager.resource.cpu-vcores = 16 From these settings, each node should be running 16 containers. Let me know if there might be something else affecting the container count. thanks, hari
Re: change the Yarn application container memory size when it is running
I'm not sure I understand the use case for something like that. I'm pretty sure the YARN API doesn't support it though. What you might be able to do is to tear down your existing container and request a new one. On Mon, Feb 10, 2014 at 10:28 AM, Thomas Bentsen t...@bentzn.com wrote: I am no Yarn expert at all - in fact I have never run it. I am still an absolute beginner with Hadoop. But... Provided Yarn is running in _one_ VM (java process) and does not fork processes on the OS you can not change the memsettings once it's started. But why would you want to change it? The JVM is able to use more or less memory as needed if you set the startup params Xmx and Xms correctly. Just be careful not to overallocate with Xmx in different processes at the same time. Can do weird things when it starts using swap. best /th On Mon, 2014-02-10 at 12:32 -0500, ricky l wrote: Hi All, Can I change the allocated memory size of a container after a container has started (running)? For example, in the beginning, I want to allocate 1GB of memory to a container A and later I want to allocate 2GB, and later 512MB. Is it a possible scenario? thx.
Re: even possible?
Don't fix it if it ain't broken =P There shouldn't be any reason why you couldn't change it (back) to the standard way that cloudera distributions are set up. Off the top of my head, I can't think of anything that you're missing. But at the same time, if your cluster is working as is, why change it? On Wed, Oct 16, 2013 at 2:24 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Question is on cdh3u4, the cluster was setup before I owned this cluster, and somehow the namenode/jobtracker/datanode/tasktracker every server process is run by a user named foo, and all job are launch and run by foo user include HDFS directories/files structure owenership, basically foo is everywhere. Today i start to think of trying to correct this by having has namenode + datanode run by hdfs user has jobtracker + tasktracker run by mapred user So far, i have a very short list that need to be changed, and i will try out in the test cluster. eg. create use hdfs, mapred every where ownership of dfs.name.dir, dfs.data.dir, fs.checkpoint.dir will change to hdfs ownership of mapred.local.dir, will change to mapred restart the cluster with hdfs for HDFS side, and mapred for MapRed side. i am 100% sure that i missed certain things that have to take care, I will really appreciate all the input. However, the original question i would love to ask is this even feasible or make sense trying to change this. Thanks P
Re: Yarn killing my Application Master
Thank you so much Jian. While refactoring my code, I accidentally deleted the line that registered my app with the RM. Not sure how I missed it... doh! On Sat, Oct 12, 2013 at 9:14 PM, Jian He j...@hortonworks.com wrote: By looking at the logs, it shows your application was killed and did not successfully registered with RM On Fri, Oct 11, 2013 at 3:53 PM, Pradeep Gollakota pradeep...@gmail.comwrote: All, I have a Yarn application that is launching a single container. The container completes successfully but the application fails because the node manager is killing my application master for some reason. The application at one point was working, I'm not sure what code changes I made that broke it. I've attached the appropriate logs. Thanks in advance for any help. - Pradeep CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: State of Art in Hadoop Log aggregation
There are plenty of log aggregation tools both open source and commercial off the shelf. Here's some http://devopsangle.com/2012/04/19/8-splunk-alternatives/ My personal recommendation is LogStash. On Thu, Oct 10, 2013 at 10:38 PM, Raymond Tay raymondtay1...@gmail.comwrote: You can try Chukwa which is part of the incubating projects under Apache. Tried it before and liked it for aggregating logs. On 11 Oct, 2013, at 1:36 PM, Sagar Mehta sagarme...@gmail.com wrote: Hi Guys, We have fairly decent sized Hadoop cluster of about 200 nodes and was wondering what is the state of art if I want to aggregate and visualize Hadoop ecosystem logs, particularly 1. Tasktracker logs 2. Datanode logs 3. Hbase RegionServer logs One way is to use something like a Flume on each node to aggregate the logs and then use something like Kibana - http://www.elasticsearch.org/overview/kibana/ to visualize the logs and make them searchable. However I don't want to write another ETL for the hadoop/hbase logs themselves. We currently log in to each machine individually to 'tail -F logs' when there is an hadoop problem on a particular node. We want a better way to look at the hadoop logs themselves in a centralized way when there is an issue without having to login to 100 different machines and was wondering what is the state of are in this regard. Suggestions/Pointers are very welcome!! Sagar
Re: Improving MR job disk IO
Actually... I believe that is expected behavior. Since your CPU is pegged at 100% you're not going to be IO bound. Typically jobs tend to be CPU bound or IO bound. If you're CPU bound you expect to see low IO throughput. If you're IO bound, you expect to see low CPU usage. On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin secs...@gmail.com wrote: Hi, I have a simple Grep job (from bundled examples) that I am running on a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node. When I run the Grep job, I notice that CPU gets pegged to 100% on multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a single disk on each node. So I guess, the cluster is poorly performing in terms of disk IO. Running Terasort, I see each disk puts out 25-35 Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec. How do I go about re-configuring or re-writing the job to utilize maximum disk IO? TIA, Xuri
Re: Improving MR job disk IO
I don't think it necessarily means that the job is a bad candidate for MR. It's a different type of a workload. Hortonworks has a great article on the different types of workloads you might see and how that affects your provisioning choices at http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html I have not looked at the Grep code so I'm not sure why it's behaving the way it is. Still curious that streaming has a higher IO throughput and lower CPU usage. It may have to do with the fact that /bin/grep is a native implementation and Grep (Hadoop) is probably using Java Pattern/Matcher api. On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin secs...@gmail.com wrote: Thanks Pradeep. Does it mean this job is a bad candidate for MR? Interestingly, running the cmdline '/bin/grep' under a streaming job provides (1) Much better disk throughput and, (2) CPU load is almost evenly spread across all cores/threads (no CPU gets pegged to 100%). On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota pradeep...@gmail.comwrote: Actually... I believe that is expected behavior. Since your CPU is pegged at 100% you're not going to be IO bound. Typically jobs tend to be CPU bound or IO bound. If you're CPU bound you expect to see low IO throughput. If you're IO bound, you expect to see low CPU usage. On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin secs...@gmail.com wrote: Hi, I have a simple Grep job (from bundled examples) that I am running on a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node. When I run the Grep job, I notice that CPU gets pegged to 100% on multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a single disk on each node. So I guess, the cluster is poorly performing in terms of disk IO. Running Terasort, I see each disk puts out 25-35 Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec. How do I go about re-configuring or re-writing the job to utilize maximum disk IO? TIA, Xuri
Re: modify HDFS
Since hadoop 3.0 is 2 major versions higher, it will be significantly different than working with hadoop 1.1.2. The hadoop-1.1 branch is available on SVN at http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1/ On Tue, Oct 1, 2013 at 11:30 PM, Karim Awara karim.aw...@kaust.edu.sawrote: Hi all, My previous web surfing led me to such steps that I executed successfully. However, my issue is that what version of hadoop this is? (I believe it is hadoop 3.0 since it supports maven build). However, I want to modify a stable release (hadoop 1.1.2, which does not have the maven build support). Will working with the hadoop 3.0 make a lot of difference to me? Secondly, my goal is to modify the block placement strategy at HDFS in a distributed environment, and test such changes myself. Now, assuming I was successful in modifying the HDFS code, how to test such modifications (since the hadoop version is actually missing configuration files and so on that make it work in distributed environment)? -- Best Regards, Karim Ahmed Awara On Wed, Oct 2, 2013 at 1:13 AM, Ravi Prakash ravi...@ymail.com wrote: Karim! You should read BUILDING.txt . I usually generate the eclipse files using mvn eclipse:eclipse Then I can import all the projects into eclipse as eclipse projects. This is useful for code navigation and completion etc. however I still build using command line: mvn -Pdist -Dmaven.javadoc.skip -DskipTests install HTH Ravi -- *From:* Jagat Singh jagatsi...@gmail.com *To:* user@hadoop.apache.org *Sent:* Tuesday, October 1, 2013 3:44 PM *Subject:* Re: modify HDFS Hi, What issue you having. I Wrote it about here , might help you Import it into eclipse as maven project http://jugnu-life.blogspot.com.au/2013/09/build-and-compile-hadoop-from-source.html?m=1 Thanks On 01/10/2013 11:56 PM, Karim Awara karim.aw...@kaust.edu.sa wrote: Hi, I want to modify the source code of HDFS to my usage, but I can't get any handy sources for development of hdfs on eclipse. (I tried the hdfs developer mailing list, but they are unresponsive). May you guide me? -- Best Regards, Karim Ahmed Awara -- This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email. -- This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
Re: IncompatibleClassChangeError
I believe it's a difference between the version that your code was compiled against vs the version that you're running against. Make sure that you're not packaging hadoop jar's into your jar and make sure you're compiling against the correct version as well. On Sun, Sep 29, 2013 at 7:27 PM, lei liu liulei...@gmail.com wrote: I use the CDH-4.3.1 and mr1, when I run one job, I am getting the following error. Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:152) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1080) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945) at java.security.AccessController.doPrivileged(Nativ e Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945) at org.apache.hadoop.mapreduce.Job.submit(Job.java:566) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596) at com.taobao.hbase.test.RandomKVGenerater.main(RandomKVGenerater.java:248) How can I handle the error? Thanks, LiuLei
Re: IncompatibleClassChangeError
I'm not entirely sure what the differences are... but according to Cloudera documentation, upgrading from CDH3 to CDH4 does involve a recompile. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Release-Notes/cdh4ki_topic_1_6.html On Sun, Sep 29, 2013 at 8:29 PM, lei liu liulei...@gmail.com wrote: Yes, My job is compiled in CHD3u3, and I run the job on CDH4.3.1, but I use the mr1 of CHD4.3.1 to run the job. What are the different mr1 of cdh4 and mr of cdh3? Thanks, LiuLei 2013/9/30 Pradeep Gollakota pradeep...@gmail.com I believe it's a difference between the version that your code was compiled against vs the version that you're running against. Make sure that you're not packaging hadoop jar's into your jar and make sure you're compiling against the correct version as well. On Sun, Sep 29, 2013 at 7:27 PM, lei liu liulei...@gmail.com wrote: I use the CDH-4.3.1 and mr1, when I run one job, I am getting the following error. Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:152) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1080) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945) at java.security.AccessController.doPrivileged(Nativ e Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945) at org.apache.hadoop.mapreduce.Job.submit(Job.java:566) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596) at com.taobao.hbase.test.RandomKVGenerater.main(RandomKVGenerater.java:248) How can I handle the error? Thanks, LiuLei
Re: Help writing a YARN application
Hi Arun, Thanks so much for your reply. I looked at the app you linked, it's way easier to understand. Unfortunately, I can't use YarnClient class. It doesn't seem to be in the 2.0.0 branch. Our production environment is using cdh-4.x and it looks like they're tracking the 2.0.x branch. However, I have looked at other implementations (notably Kitten and Giraph) for inspiration and able to get my containers successfully launching. Thanks for the help! - Pradeep On Mon, Sep 23, 2013 at 4:46 PM, Arun C Murthy a...@hortonworks.com wrote: I looked at your code, it's using really old apis/protocols which are significantly different now. See hadoop-2.1.0-beta release for latest apis/protocols. Also, you should really be using yarn client module rather than raw protocols. See https://github.com/hortonworks/simple-yarn-app for a simple example. thanks, Arun On Sep 20, 2013, at 11:24 AM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi All, I've been trying to write a Yarn application and I'm completely lost. I'm using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my sample code to github at https://github.com/pradeepg26/sample-yarn The problem is that my application master is exiting with a status of 1 (I'm expecting that since my code isn't complete yet). But I have no logs that I can examine. So I'm not sure if the error I'm getting is the error I'm expecting. I've attached the nodemanger and resourcemanager logs for your reference as well. How can I get started on writing YARN applications beyond the initial tutorial? Thanks for any help/pointers! Pradeep yarn-yarn-nodemanager-pradeep-gollakota.vm.lithium.com.log yarn-yarn-resourcemanager-pradeep-gollakota.vm.lithium.com.log -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: How to best decide mapper output/reducer input for a huge string?
Pavan, It's hard to tell whether there's anything wrong with your design or not since you haven't given us specific enough details. The best thing you can do is instrument your code and see what is taking a long time. Rahul mentioned a problem that I myself have seen before, with only one region (or a couple) having any data. So even if you have 21 regions, only mapper might be doing the heavy lifting. A combiner is hugely helpful in terms of reducing the data output of mappers. Writing a combiner is a best practice and you should almost always have one. Compression can be turned on by setting the following properties in your job config. property name mapreduce.map.output.compress /name value true/value /property property namemapreduce.map.output.compress.codec/name valueorg.apache.hadoop.io.compress.GzipCodec/value /property You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. depending on your use cases. Gzip is really slow but gets the best compression ratios. Snappy/Lzo are a lot faster but don't have as good of a compression ratio. If your computations are CPU bound, then you'd probably want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can use Gzip. You'll have to experiment and find the best settings for you. There are a lot of other tweaks that you can try to get the best performance out of your cluster. One of the best things you can do is to install Ganglia (or some other similar tool) on your cluster and monitor usage of resources while your job is running. This will tell you if your job is I/O bound or CPU bound. Take a look at this paper by Intel about optimizing your Hadoop cluster and see if that fits your deployment. http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf If your cluster is already optimized and your job is not I/O bound, then there might be a problem with your algorithm and might warrant a redesign. Hope this helps! - Pradeep On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: One mapper is spawned per hbase table region. You can use the admin ui of hbase to find the number of regions per table. It might happen that all the data is sitting in a single region , so a single mapper is spawned and you are not getting enough parallel work getting done. If that is the case then you can recreate the tables with predefined splits to create more regions. Thanks, Rahul On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.netwrote: Pavan, How large are the rows in HBase? 22 million rows is not very much but you mentioned “huge strings”. Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)? John ** ** ** ** *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com] *Sent:* Saturday, September 21, 2013 2:17 AM *To:* user@hadoop.apache.org *Subject:* Re: How to best decide mapper output/reducer input for a huge string? ** ** No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed. Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow.. Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping.. ** ** On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota pradeep...@gmail.com wrote: One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up. ** ** Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner? ** ** Have you been able to profile your code to see where the bottlenecks are? ** ** On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.com wrote: Hi Pradeep, Yes.. Basically i'm only writing the key part as the map output.. The V of K,V is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster.. Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers) ** ** I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either
Re: How to best decide mapper output/reducer input for a huge string?
I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key. The SortComparator will tell Hadoop how to sort your data. So you use this if you have a need for a non lexical sort order. The GroupingComparator will tell Hadoop how to group your data for the reducer. All KV-pairs from the same group will be given to the same Reducer. If your reduce computation needs all the KV-pairs for the same HouseHoldId, then you will need to write a GroupingComparator. Also, have you considered using a higher level abstraction on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier to write in these languages. Hope this helps! - Pradeep On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra pavan0...@gmail.comwrote: I need to improve my MR jobs which uses HBase as source as well as sink.. Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table.. Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows. The output of the mapper is something like this : HouseHoldId contentID name duration genre type channelId personId televisionID timestamp I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows: public static class AnalyzeMapper extends TableMapperText, IntWritable { } For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow.. scan.setCaching(750); scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob ( Table1, // input HBase table name scan, AnalyzeMapper.class,// mapper Text.class, // mapper output key IntWritable.class, // mapper output value job); TableMapReduceUtil.initTableReducerJob( OutputTable,// output table AnalyzeReducerTable.class, // reducer class job); job.setNumReduceTasks(RegionCount); My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster. Should i use a custom SortComparator or a Group Comparator? -- Regards- Pavan
Re: How to best decide mapper output/reducer input for a huge string?
One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up. Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner? Have you been able to profile your code to see where the bottlenecks are? On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.comwrote: Hi Pradeep, Yes.. Basically i'm only writing the key part as the map output.. The V of K,V is not of much use to me.. But i'm hoping to change that if it leads to faster execution.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster.. Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems if i write a map output for each and every row of a 19 m row HBase table, its taking nearly a day to complete.. (21 mappers and 21 reducers) I have looked at both Pig/Hive to do the job but i'm supposed to do this via a MR job.. So, cannot use either of that.. Do you recommend me to try something if i have the data in that format? On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I'm sorry but I don't understand your question. Is the output of the mapper you're describing the key portion? If it is the key, then your data should already be sorted by HouseHoldId since it occurs first in your key. The SortComparator will tell Hadoop how to sort your data. So you use this if you have a need for a non lexical sort order. The GroupingComparator will tell Hadoop how to group your data for the reducer. All KV-pairs from the same group will be given to the same Reducer. If your reduce computation needs all the KV-pairs for the same HouseHoldId, then you will need to write a GroupingComparator. Also, have you considered using a higher level abstraction on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier to write in these languages. Hope this helps! - Pradeep On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra pavan0...@gmail.comwrote: I need to improve my MR jobs which uses HBase as source as well as sink.. Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out as one huge string for the reducer to do some computation and dump into a HBase Table.. Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows. The output of the mapper is something like this : HouseHoldId contentID name duration genre type channelId personId televisionID timestamp I'm interested in sorting it on the basis of the HouseHoldID value so i'm using this technique. I'm not interested in the V part of pair so i'm kind of ignoring it. My mapper class is defined as follows: public static class AnalyzeMapper extends TableMapperText, IntWritable { } For my MR job to be completed, it takes 22 hours to complete which is not desirable at all. I'm supposed to optimize this somehow to run a lot faster somehow.. scan.setCaching(750); scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob ( Table1, // input HBase table name scan, AnalyzeMapper.class,// mapper Text.class, // mapper output key IntWritable.class, // mapper output value job); TableMapReduceUtil.initTableReducerJob( OutputTable,// output table AnalyzeReducerTable.class, // reducer class job); job.setNumReduceTasks(RegionCount); My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 node cloudera cluster. Should i use a custom SortComparator or a Group Comparator? -- Regards- Pavan -- Regards- Pavan
Help writing a YARN application
Hi All, I've been trying to write a Yarn application and I'm completely lost. I'm using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my sample code to github at https://github.com/pradeepg26/sample-yarn The problem is that my application master is exiting with a status of 1 (I'm expecting that since my code isn't complete yet). But I have no logs that I can examine. So I'm not sure if the error I'm getting is the error I'm expecting. I've attached the nodemanger and resourcemanager logs for your reference as well. How can I get started on writing YARN applications beyond the initial tutorial? Thanks for any help/pointers! Pradeep yarn-yarn-nodemanager-pradeep-gollakota.vm.lithium.com.log Description: Binary data yarn-yarn-resourcemanager-pradeep-gollakota.vm.lithium.com.log Description: Binary data