Re: How to stop a mapreduce job from terminal running on Hadoop Cluster?

2015-04-12 Thread Pradeep Gollakota
Also, mapred job -kill job_id

On Sun, Apr 12, 2015 at 11:07 AM, Shahab Yunus shahab.yu...@gmail.com
wrote:

 You can kill t by using the following yarn command

 yarn application -kill application id

 https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html

 Or use old hadoop job command
 http://stackoverflow.com/questions/11458519/how-to-kill-hadoop-jobs

 Regards,
 Shahab

 On Sun, Apr 12, 2015 at 2:03 PM, Answer Agrawal yrsna.tse...@gmail.com
 wrote:

 To run a job we use the command
 $ hadoop jar example.jar inputpath outputpath
 If job is so time taken and we want to stop it in middle then which
 command is used? Or is there any other way to do that?

 Thanks,






Re: Custom FileInputFormat.class

2014-12-01 Thread Pradeep Gollakota
Can you expand on your use case a little bit please? It may be that you're
duplicating functionality.

You can take a look at the CombineFileInputFormat for inspiration. If this
is indeed taking a long time, one cheap to implement thing you can do is to
parallelize the calls to get block locations.

Another question to ask yourself is whether it is worth it to optimize this
portion. In many use cases, (certainly mine), the bottleneck is the running
job itself. So the launch overhead is comparatively minimal.

Hope this helps.
Pradeep

On Mon Dec 01 2014 at 8:38:30 AM 胡斐 hufe...@gmail.com wrote:

 Hi,

 I want to custom FileInputFormat.class. In order to determine which host
 the specific part of a file belongs to, I need to open the file in HDFS and
 read some information. It will take me nearly 500ms to open a file and get
 the information I need. But now I have thousands of files to deal with, so
 it would be a long time if I deal with all of them as the above.

 Is there any better solution to reduce the time when the number of files
 is large?

 Thanks in advance!
 Fei




Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects

2014-10-06 Thread Pradeep Gollakota
I agree with the answers suggested above.

3. B
4. D
5. C

On Mon, Oct 6, 2014 at 2:58 PM, Ulul had...@ulul.org wrote:

  Hi

 No, Pig is a data manipulation language for data already in Hadoop.
 The question is about importing data from OLTP DB (eg Oracle, MySQL...) to
 Hadoop, this is what Sqoop is for (SQL to Hadoop)

 I'm not certain certification guys are happy with their exam questions
 ending up on blogs and mailing lists :-)

 Ulul

  Le 06/10/2014 13:54, unmesha sreeveni a écrit :

  what about the last one? The answer is correct. Pig. Is nt it?

 On Mon, Oct 6, 2014 at 4:29 PM, adarsh deshratnam 
 adarsh.deshrat...@gmail.com wrote:

 For question 3 answer should be B and for question 4 answer should be D.

  Thanks,
 Adarsh D

 Consultant - BigData and Cloud

 [image: View my profile on LinkedIn]
 http://in.linkedin.com/in/adarshdeshratnam

 On Mon, Oct 6, 2014 at 2:25 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

  Hi

  5 th question can it be SQOOP?

 On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

  Yes

 On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar 
 skumar.bigd...@hotmail.com wrote:

  Are you preparing g for Cloudera certification exam?





 Thanks and Regards,

 Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh

 (510) 936-2650

 Sr Data Consultant - BigData Implementations.

 [image: View my profile on LinkedIn]
 http://www.linkedin.com/in/sinhasantosh







 *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com]
 *Sent:* Monday, October 06, 2014 12:45 AM
 *To:* User - Hive; User Hadoop; User Pig
 *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects




 http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html



 --

 *Thanks  Regards *



 *Unmesha Sreeveni U.B*

 *Hadoop, Bigdata Developer*

 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*

 http://www.unmeshasreeveni.blogspot.in/








  --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B *
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
  http://www.unmeshasreeveni.blogspot.in/





  --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B *
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
  http://www.unmeshasreeveni.blogspot.in/






  --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B *
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
  http://www.unmeshasreeveni.blogspot.in/






Re: datanode down, disk replaced , /etc/fstab changed. Can't bring it back up. Missing lock file?

2014-10-03 Thread Pradeep Gollakota
Looks like you're facing the same problem as this SO.
http://stackoverflow.com/questions/10705140/hadoop-datanode-fails-to-start-throwing-org-apache-hadoop-hdfs-server-common-sto

Try the suggested fix.

On Fri, Oct 3, 2014 at 6:57 PM, Colin Kincaid Williams disc...@uw.edu
wrote:

 We had a datanode go down, and our datacenter guy swapped out the disk. We
 had moved to using UUIDs in the /etc/fstab, but he wanted to use the
 /dev/id format. He didn't backup the fstab, however I'm not sure that's the
 issue.

 I am reading in the log below that the namenode has a lock on the disk? I
 don't know how that works. I thought the lockfile would belong to the
 datanode itself. How do I remove the lock from the namenode to bring the
 datanode back up?

 If that's not the issue, how can I bring the datanode back up? Help would
 be greatly appreciated.




 2014-10-03 18:28:18,121 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = us3sm2hb027r09.comp.prod.local/10.51.28.172
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 2.3.0-cdh5.0.1
 STARTUP_MSG:   classpath =
 

Re: Block placement without rack aware

2014-10-02 Thread Pradeep Gollakota
It appears to be randomly chosen. I just came across this blog post from
Lars George about HBase file locality in HDFS
http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html

On Thu, Oct 2, 2014 at 4:12 PM, SF Hadoop sfhad...@gmail.com wrote:

 What is the block placement policy hadoop follows when rack aware is not
 enabled?

 Does it just round robin?

 Thanks.



Rolling upgrades

2014-08-01 Thread Pradeep Gollakota
Hi All,

Is it possible to do a rolling upgrade from Hadoop 2.2 to 2.4?

Thanks,
Pradeep


Re: YARN creates only 1 container

2014-05-27 Thread Pradeep Gollakota
I believe it's behaving as expected. It will spawn 64 containers because
that's how much memory you have available. The vcores isn't harshly
enforced since CPUs can be elastic. This blog from cloudera explain how to
enforce CPU limits using CGroups.
http://blog.cloudera.com/blog/2013/12/managing-multiple-resources-in-hadoop-2-with-yarn/


On Tue, May 27, 2014 at 8:56 PM, hari harib...@gmail.com wrote:

 The issue was not related the configuration related to containers. Due to
 misconfiguration, the Application master was not able to contact
 resourcemanager
 causing in the 1 container problem.

 However, the total containers allocated still is not as expected. The
 configuration settings
 should have resulted in 16 containers per node, but it is allocating 64
 containers per node.

 Reiterating the config parameters here again:

 mapred-site.xml
 mapreduce.map.cpu.vcores = 1
 mapreduce.reduce.cpu.vcores = 1
 mapreduce.map.memory.mb = 1024
 mapreduce.reduce.memory.mb = 1024
 mapreduce.map.java.opts = -Xmx1024m
 mapreduce.reduce.java.opts = -Xmx1024m

 yarn.xml
 yarn.nodemanager.resource.memory-mb = 65536
 yarn.nodemanager.resource.cpu-vcores = 16
 yarn.scheduler.minimum-allocation-mb = 1024
 yarn.scheduler.maximum-allocation-mb  = 2048
 yarn.scheduler.minimum-allocation-vcores = 1
 yarn.scheduler.maximum-allocation-vcores = 1

 Is there anything else that might be causing this problem ?

 thanks,
 hari





 On Tue, May 27, 2014 at 3:31 AM, hari harib...@gmail.com wrote:

 Hi,

 When using YARN 2.2.0 version, only 1 container is created
 for an application in the entire cluster.
 The single container is created at an arbitrary node
 for every run. This happens when running any application from
 the examples jar (e.g., wordcount). Currently only one application is
 run at a time. The input datasize is  200GB.

 I am setting custom values that affect concurrent container count.
 These config parameters were mostly taken from:

 http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
 These wasn't much description elsewhere on how the container count would
 be
 decided.

 The settings are:

 mapred-site.xml
 mapreduce.map.cpu.vcores = 1
 mapreduce.reduce.cpu.vcores = 1
 mapreduce.map.memory.mb = 1024
 mapreduce.reduce.memory.mb = 1024
 mapreduce.map.java.opts = -Xmx1024m
 mapreduce.reduce.java.opts = -Xmx1024m

 yarn.xml
 yarn.nodemanager.resource.memory-mb = 65536
 yarn.nodemanager.resource.cpu-vcores = 16

 From these settings, each node should be running 16 containers.

 Let me know if there might be something else affecting the container
 count.

 thanks,
 hari









Re: change the Yarn application container memory size when it is running

2014-02-10 Thread Pradeep Gollakota
I'm not sure I understand the use case for something like that. I'm pretty
sure the YARN API doesn't support it though. What you might be able to do
is to tear down your existing container and request a new one.


On Mon, Feb 10, 2014 at 10:28 AM, Thomas Bentsen t...@bentzn.com wrote:

 I am no Yarn expert at all - in fact I have never run it. I am still an
 absolute beginner with Hadoop.
 But... Provided Yarn is running in _one_ VM (java process) and does not
 fork processes on the OS you can not change the memsettings once it's
 started.
 But why would you want to change it? The JVM is able to use more or less
 memory as needed if you set the startup params Xmx and Xms correctly.
 Just be careful not to overallocate with Xmx in different processes at
 the same time. Can do weird things when it starts using swap.


 best
 /th




 On Mon, 2014-02-10 at 12:32 -0500, ricky l wrote:
  Hi All,
 
  Can I change the allocated memory size of a container after a
  container has started (running)? For example, in the beginning, I want
  to allocate 1GB of memory to a container A and later I want to
  allocate 2GB, and later 512MB. Is it a possible scenario? thx.





Re: even possible?

2013-10-16 Thread Pradeep Gollakota
Don't fix it if it ain't broken =P

There shouldn't be any reason why you couldn't change it (back) to the
standard way that cloudera distributions are set up. Off the top of my
head, I can't think of anything that you're missing. But at the same time,
if your cluster is working as is, why change it?


On Wed, Oct 16, 2013 at 2:24 PM, Patai Sangbutsarakum 
silvianhad...@gmail.com wrote:

 Question is on cdh3u4, the cluster was setup before I owned this cluster,
 and somehow the namenode/jobtracker/datanode/tasktracker every server
 process is run by a user named foo, and all job are launch and run by foo
 user include HDFS directories/files structure owenership, basically foo is
 everywhere.

 Today i start to think of trying to correct this by having
 has namenode + datanode run by hdfs user
 has jobtracker + tasktracker run by mapred user

 So far, i have a very short list that need to be changed, and i will try
 out in the test cluster.
 eg.
 create use hdfs, mapred every where
 ownership of dfs.name.dir, dfs.data.dir, fs.checkpoint.dir will change to
 hdfs
 ownership of mapred.local.dir, will change to mapred
 restart the cluster with hdfs for HDFS side, and mapred for MapRed side.

 i am 100% sure that i missed certain things that have to take care, I will
 really appreciate all the input.

 However, the original question i would love to ask is this even feasible
 or make sense trying to change this.


 Thanks
 P



Re: Yarn killing my Application Master

2013-10-14 Thread Pradeep Gollakota
Thank you so much Jian.

While refactoring my code, I accidentally deleted the line that registered
my app with the RM. Not sure how I missed it... doh!


On Sat, Oct 12, 2013 at 9:14 PM, Jian He j...@hortonworks.com wrote:

 By looking at the logs, it shows your application was killed  and did not
 successfully registered with RM


 On Fri, Oct 11, 2013 at 3:53 PM, Pradeep Gollakota 
 pradeep...@gmail.comwrote:

 All,

 I have a Yarn application that is launching a single container. The
 container completes successfully but the application fails because the node
 manager is killing my application master for some reason. The application
 at one point was working, I'm not sure what code changes I made that broke
 it.

 I've attached the appropriate logs.

 Thanks in advance for any help.

 - Pradeep



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.


Re: State of Art in Hadoop Log aggregation

2013-10-11 Thread Pradeep Gollakota
There are plenty of log aggregation tools both open source and commercial
off the shelf. Here's some
http://devopsangle.com/2012/04/19/8-splunk-alternatives/

My personal recommendation is LogStash.


On Thu, Oct 10, 2013 at 10:38 PM, Raymond Tay raymondtay1...@gmail.comwrote:

 You can try Chukwa which is part of the incubating projects under Apache.
 Tried it before and liked it for aggregating logs.

 On 11 Oct, 2013, at 1:36 PM, Sagar Mehta sagarme...@gmail.com wrote:

 Hi Guys,

 We have fairly decent sized Hadoop cluster of about 200 nodes and was
 wondering what is the state of art if I want to aggregate and visualize
 Hadoop ecosystem logs, particularly

1. Tasktracker logs
2. Datanode logs
3. Hbase RegionServer logs

 One way is to use something like a Flume on each node to aggregate the
 logs and then use something like Kibana -
 http://www.elasticsearch.org/overview/kibana/ to visualize the logs and
 make them searchable.

 However I don't want to write another ETL for the hadoop/hbase logs
  themselves. We currently log in to each machine individually to 'tail -F
 logs' when there is an hadoop problem on a particular node.

 We want a better way to look at the hadoop logs themselves in a
 centralized way when there is an issue without having to login to 100
 different machines and was wondering what is the state of are in this
 regard.

 Suggestions/Pointers are very welcome!!

 Sagar





Re: Improving MR job disk IO

2013-10-10 Thread Pradeep Gollakota
Actually... I believe that is expected behavior. Since your CPU is pegged
at 100% you're not going to be IO bound. Typically jobs tend to be CPU
bound or IO bound. If you're CPU bound you expect to see low IO throughput.
If you're IO bound, you expect to see low CPU usage.


On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin secs...@gmail.com wrote:

 Hi,

 I have a simple Grep job (from bundled examples) that I am running on a
 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT
 on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.

 When I run the Grep job, I notice that CPU gets pegged to 100% on multiple
 cores but disk throughput remains a dismal 1-2 Mbytes/sec on a single disk
 on each node. So I guess, the cluster is poorly performing in terms of disk
 IO. Running Terasort, I see each disk puts out 25-35 Mbytes/sec with a
 total cluster throughput of above 1.5 Gbytes/sec.

 How do I go about re-configuring or re-writing the job to utilize maximum
 disk IO?

 TIA,

 Xuri





Re: Improving MR job disk IO

2013-10-10 Thread Pradeep Gollakota
I don't think it necessarily means that the job is a bad candidate for MR.
It's a different type of a workload. Hortonworks has a great article on the
different types of workloads you might see and how that affects your
provisioning choices at
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html

I have not looked at the Grep code so I'm not sure why it's behaving the
way it is. Still curious that streaming has a higher IO throughput and
lower CPU usage. It may have to do with the fact that /bin/grep is a native
implementation and Grep (Hadoop) is probably using Java Pattern/Matcher api.


On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin secs...@gmail.com wrote:

 Thanks Pradeep. Does it mean this job is a bad candidate for MR?

 Interestingly, running the cmdline '/bin/grep' under a streaming job
 provides (1) Much better disk throughput and, (2) CPU load is almost evenly
 spread across all cores/threads (no CPU gets pegged to 100%).




 On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota 
 pradeep...@gmail.comwrote:

 Actually... I believe that is expected behavior. Since your CPU is pegged
 at 100% you're not going to be IO bound. Typically jobs tend to be CPU
 bound or IO bound. If you're CPU bound you expect to see low IO throughput.
 If you're IO bound, you expect to see low CPU usage.


 On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin secs...@gmail.com wrote:

 Hi,

 I have a simple Grep job (from bundled examples) that I am running on a
 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT
 on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.

 When I run the Grep job, I notice that CPU gets pegged to 100% on
 multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a
 single disk on each node. So I guess, the cluster is poorly performing in
 terms of disk IO. Running Terasort, I see each disk puts out 25-35
 Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec.

 How do I go about re-configuring or re-writing the job to utilize
 maximum disk IO?

 TIA,

 Xuri







Re: modify HDFS

2013-10-02 Thread Pradeep Gollakota
Since hadoop 3.0 is 2 major versions higher, it will be significantly
different than working with hadoop 1.1.2. The hadoop-1.1 branch is
available on SVN at
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1/




On Tue, Oct 1, 2013 at 11:30 PM, Karim Awara karim.aw...@kaust.edu.sawrote:

 Hi all,

 My previous web surfing led me to such steps that I executed successfully.
 However, my issue is that what version of hadoop this is? (I believe it is
 hadoop 3.0 since it supports maven build). However, I want to modify a
 stable release (hadoop 1.1.2, which does not have the maven build
 support).  Will working with the hadoop 3.0 make a lot of difference to me?

 Secondly, my goal is to modify the block placement strategy at HDFS in a
 distributed environment, and test such changes myself. Now, assuming I was
 successful in modifying the HDFS code, how to test such modifications
 (since the hadoop version is actually missing configuration files and so on
 that make it work in distributed environment)?

 --
 Best Regards,
 Karim Ahmed Awara


 On Wed, Oct 2, 2013 at 1:13 AM, Ravi Prakash ravi...@ymail.com wrote:

 Karim!

 You should read BUILDING.txt . I usually generate the eclipse files using

 mvn eclipse:eclipse

 Then I can import all the projects into eclipse as eclipse projects. This
 is useful for code navigation and completion etc. however I still build
 using command line:
 mvn -Pdist -Dmaven.javadoc.skip -DskipTests install

 HTH
 Ravi

   --
  *From:* Jagat Singh jagatsi...@gmail.com
 *To:* user@hadoop.apache.org
 *Sent:* Tuesday, October 1, 2013 3:44 PM
 *Subject:* Re: modify HDFS

 Hi,
 What issue you having.
 I Wrote it about here , might help you
 Import it into eclipse as maven project

 http://jugnu-life.blogspot.com.au/2013/09/build-and-compile-hadoop-from-source.html?m=1
 Thanks
 On 01/10/2013 11:56 PM, Karim Awara karim.aw...@kaust.edu.sa wrote:

 Hi,

 I want to modify the source code of HDFS to my usage, but I can't get any
 handy sources for development of hdfs on eclipse. (I tried the hdfs
 developer mailing list, but they are unresponsive). May you guide me?

 --
 Best Regards,
 Karim Ahmed Awara

 --
 This message and its contents, including attachments are intended solely
 for the original recipient. If you are not the intended recipient or have
 received this message in error, please notify me immediately and delete
 this message from your computer system. Any unauthorized use or
 distribution is prohibited. Please consider the environment before printing
 this email.





 --
 This message and its contents, including attachments are intended solely
 for the original recipient. If you are not the intended recipient or have
 received this message in error, please notify me immediately and delete
 this message from your computer system. Any unauthorized use or
 distribution is prohibited. Please consider the environment before printing
 this email.



Re: IncompatibleClassChangeError

2013-09-29 Thread Pradeep Gollakota
I believe it's a difference between the version that your code was compiled
against vs the version that you're running against. Make sure that you're
not packaging hadoop jar's into your jar and make sure you're compiling
against the correct version as well.


On Sun, Sep 29, 2013 at 7:27 PM, lei liu liulei...@gmail.com wrote:

 I use the CDH-4.3.1 and mr1, when I run one job, I am getting the
 following error.

 Exception in thread main java.lang.IncompatibleClassChangeError: Found 
 interface org.apache.hadoop.mapreduce.JobContext, but class was expected

 at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:152)

 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063)

 at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1080)

 at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945)
 at java.security.AccessController.doPrivileged(Nativ
 e Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)

 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)

 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)

 at 
 com.taobao.hbase.test.RandomKVGenerater.main(RandomKVGenerater.java:248)


 How can I handle the error?

 Thanks,

 LiuLei



Re: IncompatibleClassChangeError

2013-09-29 Thread Pradeep Gollakota
I'm not entirely sure what the differences are... but according to Cloudera
documentation, upgrading from CDH3 to CDH4 does involve a recompile.
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Release-Notes/cdh4ki_topic_1_6.html



On Sun, Sep 29, 2013 at 8:29 PM, lei liu liulei...@gmail.com wrote:

 Yes, My job is compiled in CHD3u3, and I run the job on CDH4.3.1,  but I
 use the mr1 of CHD4.3.1 to run the job.

 What are the different mr1 of cdh4 and mr of cdh3?

 Thanks,

 LiuLei


 2013/9/30 Pradeep Gollakota pradeep...@gmail.com

 I believe it's a difference between the version that your code was
 compiled against vs the version that you're running against. Make sure that
 you're not packaging hadoop jar's into your jar and make sure you're
 compiling against the correct version as well.


 On Sun, Sep 29, 2013 at 7:27 PM, lei liu liulei...@gmail.com wrote:

 I use the CDH-4.3.1 and mr1, when I run one job, I am getting the
 following error.

 Exception in thread main java.lang.IncompatibleClassChangeError: Found 
 interface org.apache.hadoop.mapreduce.JobContext, but class was expected

 at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:152)

 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063)

 at 
 org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1080)

 at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945)
 at java.security.AccessController.doPrivileged(Nativ
 e Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)

 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)

 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)

 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)

 at 
 com.taobao.hbase.test.RandomKVGenerater.main(RandomKVGenerater.java:248)


 How can I handle the error?

 Thanks,

 LiuLei






Re: Help writing a YARN application

2013-09-23 Thread Pradeep Gollakota
Hi Arun,

Thanks so much for your reply. I looked at the app you linked, it's way
easier to understand. Unfortunately, I can't use YarnClient class. It
doesn't seem to be in the 2.0.0 branch. Our production environment is using
cdh-4.x and it looks like they're tracking the 2.0.x branch.

However, I have looked at other implementations (notably Kitten and Giraph)
for inspiration and able to get my containers successfully launching.

Thanks for the help!
- Pradeep


On Mon, Sep 23, 2013 at 4:46 PM, Arun C Murthy a...@hortonworks.com wrote:

 I looked at your code, it's using really old apis/protocols which are
 significantly different now. See hadoop-2.1.0-beta release for latest
 apis/protocols.

 Also, you should really be using yarn client module rather than raw
 protocols.

 See https://github.com/hortonworks/simple-yarn-app for a simple example.

 thanks,
 Arun

 On Sep 20, 2013, at 11:24 AM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 Hi All,

 I've been trying to write a Yarn application and I'm completely lost. I'm
 using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my
 sample code to github at https://github.com/pradeepg26/sample-yarn

 The problem is that my application master is exiting with a status of 1
 (I'm expecting that since my code isn't complete yet). But I have no logs
 that I can examine. So I'm not sure if the error I'm getting is the error
 I'm expecting. I've attached the nodemanger and resourcemanager logs for
 your reference as well.

 How can I get started on writing YARN applications beyond the initial
 tutorial?

 Thanks for any help/pointers!

 Pradeep
 yarn-yarn-nodemanager-pradeep-gollakota.vm.lithium.com.log
 yarn-yarn-resourcemanager-pradeep-gollakota.vm.lithium.com.log


 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.


Re: How to best decide mapper output/reducer input for a huge string?

2013-09-22 Thread Pradeep Gollakota
Pavan,

It's hard to tell whether there's anything wrong with your design or not
since you haven't given us specific enough details. The best thing you can
do is instrument your code and see what is taking a long time. Rahul
mentioned a problem that I myself have seen before, with only one region
(or a couple) having any data. So even if you have 21 regions, only mapper
might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of
mappers. Writing a combiner is a best practice and you should almost always
have one. Compression can be turned on by setting the following properties
in your job config.
property
name mapreduce.map.output.compress /name
value true/value
/property
property
namemapreduce.map.output.compress.codec/name
valueorg.apache.hadoop.io.compress.GzipCodec/value
/property
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
depending on your use cases. Gzip is really slow but gets the best
compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
compression ratio. If your computations are CPU bound, then you'd probably
want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
are idle, you can use Gzip. You'll have to experiment and find the best
settings for you. There are a lot of other tweaks that you can try to get
the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other
similar tool) on your cluster and monitor usage of resources while your job
is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and
see if that fits your deployment.
http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then
there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep


On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 One mapper is spawned per hbase table region. You can use the admin ui of
 hbase to find the number of regions per table. It might happen that all the
 data is sitting in a single region , so a single mapper is spawned and you
 are not getting enough parallel work getting done.

 If that is the case then you can recreate the tables with predefined
 splits to create more regions.

 Thanks,
 Rahul


 On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.netwrote:

  Pavan,

 How large are the rows in HBase?  22 million rows is not very much but
 you mentioned “huge strings”.  Can you tell which part of the processing is
 the limiting factor (read from HBase, mapper output, reducers)?

 John

 ** **

 ** **

 *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com]
 *Sent:* Saturday, September 21, 2013 2:17 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: How to best decide mapper output/reducer input for a huge
 string?

 ** **

 No, I don't have a combiner in place. Is it necessary? How do I make my
 map output compressed? Yes, the Tables in HBase are compressed.

 Although, there's no real bottleneck, the time it takes to process the
 entire table is huge. I have to constantly check if i can optimize it
 somehow.. 

 Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
 any thing wrong with my design? Does it require any kind of re-work? Thank
 you so much for helping..

 ** **

 On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 One thing that comes to mind is that your keys are Strings which are
 highly inefficient. You might get a lot better performance if you write a
 custom writable for your Key object using the appropriate data types. For
 example, use a long (LongWritable) for timestamps. This should make
 (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
 comparisons for sorting will also go up.

 ** **

 Ensure that your map output's are being compressed. Are your tables in
 HBase compressed? Do you have a combiner?

 ** **

 Have you been able to profile your code to see where the bottlenecks are?
 

 ** **

 On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.com
 wrote:

 Hi Pradeep,

 Yes.. Basically i'm only writing the key part as the map output.. The V
 of K,V is not of much use to me.. But i'm hoping to change that if it
 leads to faster execution.. I'm kind of a newbie so looking to make the
 map/reduce job run a lot faster.. 

 Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
 seems if i write a map output for each and every row of a 19 m row HBase
 table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
 

 ** **

 I have looked at both Pig/Hive to do the job but i'm supposed to do this
 via a MR job.. So, cannot use either

Re: How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread Pradeep Gollakota
I'm sorry but I don't understand your question. Is the output of the mapper
you're describing the key portion? If it is the key, then your data should
already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this
if you have a need for a non lexical sort order. The GroupingComparator
will tell Hadoop how to group your data for the reducer. All KV-pairs from
the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId,
then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such
as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
easier to write in these languages.

Hope this helps!
- Pradeep


On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra pavan0...@gmail.comwrote:

 I need to improve my MR jobs which uses HBase as source as well as sink..

 Basically, i'm reading data from 3 HBase Tables in the mapper, writing
 them out as one huge string for the reducer to do some computation and dump
 into a HBase Table..

 Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.

 The output of the mapper is something like this :

 HouseHoldId contentID name duration genre type channelId personId 
 televisionID timestamp

 I'm interested in sorting it on the basis of the HouseHoldID value so i'm
 using this technique. I'm not interested in the V part of pair so i'm kind
 of ignoring it. My mapper class is defined as follows:

 public static class AnalyzeMapper extends TableMapperText, IntWritable { }

 For my MR job to be completed, it takes 22 hours to complete which is not
 desirable at all. I'm supposed to optimize this somehow to run a lot faster
 somehow..

 scan.setCaching(750);
 scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
Table1,   // input HBase table 
 name
scan,
AnalyzeMapper.class,// mapper
Text.class, // mapper 
 output key
IntWritable.class,  // mapper 
 output value
job);

 TableMapReduceUtil.initTableReducerJob(
 OutputTable,// output 
 table
 AnalyzeReducerTable.class,  // 
 reducer class
 job);
 job.setNumReduceTasks(RegionCount);

 My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
 8 node cloudera cluster.

 Should i use a custom SortComparator or a Group Comparator?


 --
 Regards-
 Pavan



Re: How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread Pradeep Gollakota
One thing that comes to mind is that your keys are Strings which are highly
inefficient. You might get a lot better performance if you write a custom
writable for your Key object using the appropriate data types. For example,
use a long (LongWritable) for timestamps. This should make
(de)serialization a lot faster. If HouseHoldId is an integer, your speed of
comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in
HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?


On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.comwrote:

 Hi Pradeep,
 Yes.. Basically i'm only writing the key part as the map output.. The V of
 K,V is not of much use to me.. But i'm hoping to change that if it leads
 to faster execution.. I'm kind of a newbie so looking to make the
 map/reduce job run a lot faster..

 Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
 seems if i write a map output for each and every row of a 19 m row HBase
 table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

 I have looked at both Pig/Hive to do the job but i'm supposed to do this
 via a MR job.. So, cannot use either of that.. Do you recommend me to try
 something if i have the data in that format?


 On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota 
 pradeep...@gmail.comwrote:

 I'm sorry but I don't understand your question. Is the output of the
 mapper you're describing the key portion? If it is the key, then your data
 should already be sorted by HouseHoldId since it occurs first in your key.

 The SortComparator will tell Hadoop how to sort your data. So you use
 this if you have a need for a non lexical sort order. The
 GroupingComparator will tell Hadoop how to group your data for the reducer.
 All KV-pairs from the same group will be given to the same Reducer.

 If your reduce computation needs all the KV-pairs for the same
 HouseHoldId, then you will need to write a GroupingComparator.

 Also, have you considered using a higher level abstraction on Hadoop such
 as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
 easier to write in these languages.

 Hope this helps!
 - Pradeep


 On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra 
 pavan0...@gmail.comwrote:

 I need to improve my MR jobs which uses HBase as source as well as
 sink..

 Basically, i'm reading data from 3 HBase Tables in the mapper, writing
 them out as one huge string for the reducer to do some computation and dump
 into a HBase Table..

 Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.

 The output of the mapper is something like this :

 HouseHoldId contentID name duration genre type channelId personId 
 televisionID timestamp

 I'm interested in sorting it on the basis of the HouseHoldID value so
 i'm using this technique. I'm not interested in the V part of pair so i'm
 kind of ignoring it. My mapper class is defined as follows:

 public static class AnalyzeMapper extends TableMapperText, IntWritable { }

 For my MR job to be completed, it takes 22 hours to complete which is
 not desirable at all. I'm supposed to optimize this somehow to run a lot
 faster somehow..

 scan.setCaching(750);
 scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
Table1,   // input HBase 
 table name
scan,
AnalyzeMapper.class,// mapper
Text.class, // mapper 
 output key
IntWritable.class,  // mapper 
 output value
job);

 TableMapReduceUtil.initTableReducerJob(
 OutputTable,// 
 output table
 AnalyzeReducerTable.class,  // 
 reducer class
 job);
 job.setNumReduceTasks(RegionCount);

 My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
 a 8 node cloudera cluster.

 Should i use a custom SortComparator or a Group Comparator?


 --
 Regards-
 Pavan





 --
 Regards-
 Pavan



Help writing a YARN application

2013-09-20 Thread Pradeep Gollakota
Hi All,

I've been trying to write a Yarn application and I'm completely lost. I'm
using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my
sample code to github at https://github.com/pradeepg26/sample-yarn

The problem is that my application master is exiting with a status of 1
(I'm expecting that since my code isn't complete yet). But I have no logs
that I can examine. So I'm not sure if the error I'm getting is the error
I'm expecting. I've attached the nodemanger and resourcemanager logs for
your reference as well.

How can I get started on writing YARN applications beyond the initial
tutorial?

Thanks for any help/pointers!

Pradeep


yarn-yarn-nodemanager-pradeep-gollakota.vm.lithium.com.log
Description: Binary data


yarn-yarn-resourcemanager-pradeep-gollakota.vm.lithium.com.log
Description: Binary data