Big Data Weekly Quiz

2015-09-07 Thread Adam Kawa
Hi Guys! If you read Hadoop Weekly newsletter, we warmly encourage you to take our quiz http://getindata.com/big-data-weekly-quiz-1 that has questions about topics covered in this newsletter. The goal is to sharpen your knowledge and learn new information in

Re: Start standby namenode using bootstrapStandby hangs

2014-09-20 Thread Adam Kawa
Hi, After throwing out above warning info, the execution of command hangs there and did not throw any other warning/error messages any more. Have you tried to use *jstack* to check what the NameNode is really doing (e.g. whether it's blocked or waiting for something). *$ sudo -u hdfs jstack

Swtiching from Java 6 to Java 7 on Hadoop 2.2.0

2014-09-15 Thread Adam Kawa
Hi All! We are about to upgrade the Java version of our Hadoop cluster (Hadoop 2.2.0). Just would like to ask about your recommendations and experience: (A) Should we schedule a downtime of a whole cluster and then upgrade Java everywhere (all Hadoop projects e.g. HDFS, YARN, Pig, Hive, Sqoop

Re: CPU utilization

2014-09-12 Thread Adam Kawa
Hi, With these settings, your are able to start 2 containers maximally per NodeManager (yarn.nodemanager.resource.memory-mb = 2048). The size of your containers is between 768 - 1024 MBs (not sure what is your value of yarn.nodemanager.resource.cpu-vcores). Have you tried to run more (or bigger)

Re: MultipleTextOutputFormat in new api of 1.2.1?

2014-09-12 Thread Adam Kawa
Afaik, dynamic partitions in the new mapreduce API are actually not supported (please read http://grepalex.com/2013/07/16/multipleoutputs-part2/ and http://stackoverflow.com/questions/25503034/dynamic-key-based-names-of-output-files-in-new-hadoop-api ). If you don't want to use old mapred API,

Re: CPU utilization

2014-09-12 Thread Adam Kawa
Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for allocating containers. If you run map task, you need 768 MB (mapreduce.map.memory.mb). If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb). If you run the MapReduce app master, you need 1024 MB

Re: CPU utilization

2014-09-12 Thread Adam Kawa
Adam, how did you come to the conclusion that it is memory bounded? I mean the number of containers running on your NodeManager, not the job itself.

Re: YARN Logs

2014-07-15 Thread Adam Kawa
IMHO, $ yarn logs looks for aggregated logs at remote location. 2014-07-15 16:49 GMT+02:00 Brian C. Huffman bhuff...@etinternational.com: All, I am running a small cluster with hadoop-2.2.0 installed on an NFS shared directory. Since all nodes can access, I do not want to enable log

Re: YARN Logs

2014-07-15 Thread Adam Kawa
}/${username}/${yarn.nodemanager.remote-app-log-dir-suffix}/${application-id} 2014-07-15 18:08 GMT+02:00 Adam Kawa kawa.a...@gmail.com: IMHO, $ yarn logs looks for aggregated logs at remote location. 2014-07-15 16:49 GMT+02:00 Brian C. Huffman bhuff...@etinternational.com : All, I am running

Re: changing split size in Hadoop configuration

2014-07-14 Thread Adam Kawa
It sounds like JobTracker setting, so the restart looks to be required. You verify it in pseudo-distributed mode by setting it to a very low value, restarting JT and seeing if you get the exception that prints this new value. Sent from my iPhone On 14 jul 2014, at 16:03, Jan Warchoł

Re: The number of simultaneous map tasks is unexpected.

2014-07-10 Thread Adam Kawa
yarn.nodemanager.resource.memory-mb = 2370 MiB, yarn.nodemanager.resource.cpu-vcores = 2, So, you cannot run more than 8 containers on your setup (according to your settings, each container consumes 1GB and 1 vcore). Considering that I have 8 cores in my cluster and not 16 as I thought at

Re: The number of simultaneous map tasks is unexpected.

2014-07-09 Thread Adam Kawa
Tomasz Guziałek 2014-07-09 0:56 GMT+02:00 Adam Kawa kawa.a...@gmail.com: If you run an application (e.g. MapReduce job) on YARN cluster, first the Application Master will be is started on some slave node to coordinate the execution of all tasks within the job. The ApplicationMaster and tasks

Re: listing a 530k files directory

2014-07-09 Thread Adam Kawa
You can try snakebite https://github.com/spotify/snakebite. $ snakebite ls -R path I just run it to list 705K files and it went fine. 2014-05-30 20:42 GMT+02:00 Harsh J ha...@cloudera.com: The HADOOP_OPTS gets overriden by HADOOP_CLIENT_OPTS for FsShell utilities. The right way to extend

Re: issue about remove yarn jobs history logs

2014-07-09 Thread Adam Kawa
Have you restarted your Job History Server? 2014-05-30 4:56 GMT+02:00 ch huang justlo...@gmail.com: hi,maillist: i want remove jobs history logs ,and i configure the following info in yarn-site.xml,but it seems no use ,why? ( i use CDH4.4 yarn ,i configue on each datanode ,and

Re: debugging class path issues with containers.

2014-07-09 Thread Adam Kawa
You might need to set *yarn.application.classpath* in yarn-site.xml *property* * nameyarn.application.classpath/name* *

Re: how to access configuration properties on a remote Hadoop cluster

2014-07-09 Thread Adam Kawa
Instead of Resource-Manager-WebApp-Address/conf, If you have application id and job id, you can query the Resource Manager for the configuration of this particular application. You can use HTTP and Java API for that. 2014-07-09 21:42 GMT+02:00 Geoff Thompson ge...@bearpeak.com: Hello, Is

Re: The number of simultaneous map tasks is unexpected.

2014-07-08 Thread Adam Kawa
Is not your MapReduce AppMaster occupying one slot? Sent from my iPhone On 8 jul 2014, at 13:01, Tomasz Guziałek tomaszguzia...@gmail.com wrote: Hello all, I am running a 4-nodes CDH5 cluster on Amazon EC2 . The instances used are m1.large, so I have 4 cores (2 core x 2 unit) per node.

Re: OIV Tool

2014-07-08 Thread Adam Kawa
I used OIV (Offline Image Viewer) with a couple of versions of Hadoop and I have not seen any incompatibility related to it. Currently, I use it with Hadoop 2.2.0. 2014-07-08 21:06 GMT+02:00 Ashish Dobhal dobhalashish...@gmail.com: Hey everyone, Could anyone tell me which versions of hadoop

Re: How to limit MRJob's stdout/stderr size(yarn2.3)

2014-07-08 Thread Adam Kawa
There are a setting like property namemapreduce.task.userlog.limit.kb/name value0/value descriptionThe maximum size of user-logs of each task in KB. 0 disables the cap. /description /property but I have not tried it on YARN. If your disks are full, because you run many application+tasks

Re: Compare Yarn with V1

2014-07-05 Thread Adam Kawa
I wrote the blog posts that indirectly answers this question: http://www.ibm.com/developerworks/library/bd-yarn-intro/index.html?ce=ism0070ct=iscmp=ibmsocialcm=hcr=crossbrandccy=us 2014-06-13 2:16 GMT+02:00 Mohit Anchlia mohitanch...@gmail.com: Is there a good resource that draws similarity

Re: How to make hdfs data rack aware

2014-07-03 Thread Adam Kawa
You can run $ sudo -u hdfs hdfs dfsadmin -report | grep Hostname -A 1 2014-07-02 7:33 GMT+02:00 hadoop hive hadooph...@gmail.com: Try running fsck, it will also validate the block placement as well as replication. On Jun 27, 2014 6:49 AM, Kilaru, Sambaiah sambaiah_kil...@intuit.com wrote:

Re: YARN creates only 1 container

2014-07-02 Thread Adam Kawa
You also might want to increase values for mapreduce.{map,reduce}.memory.mb to 1280 or 1536 or so (assuming that mapreduce.{map,reduce}.java.opts = -Xmx1024m). mapreduce.{map,reduce}.memory.mb is logical size of the container and it should be larger than mapreduce.{map,reduce}.java.opts that

Re: Hadoop Pi Example in Yarn

2013-12-18 Thread Adam Kawa
A map task is created for each input split in your dataset. By default, an input split correlates to block in HDFS i.e. if a file consists of 1 HDFS block, then 1 map task will be started - if a file consists of N blocks, then N map task will be started for that file (obviously, assuming a default

Re: multinode hadoop cluster on vmware

2013-12-18 Thread Adam Kawa
Maybe you can try Serengeti http://www.projectserengeti.org/ or Vagrant ( http://java.dzone.com/articles/setting-hadoop-virtual-cluster, http://blog.cloudera.com/blog/2013/04/how-to-use-vagrant-to-set-up-a-virtual-hadoop-cluster/ )? 2013/12/18 navaz navaz@gmail.com Hi I want to set up a

Re: Hadoop setup doubts

2013-12-15 Thread Adam Kawa
Hi, 2. How does log aggregation work? http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/ 4. What is the purpose of the webproxy? Is it really required? http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html 5.

Re: Hadoop setup

2013-12-14 Thread Adam Kawa
In general, it is very open question and there are many possibilities depending on your workload (e.g. CPU-bound, IO-bound etc). If it is your first Hadoop cluster, and you do not know too much about what types of jobs you will be running, I would recommend just to collect any available machines

Re: hadoop fs -text OutOfMemoryError

2013-12-14 Thread Adam Kawa
Since snappy is non-splittable file (so that to decompress snappy file, you need to read it from the beginning to the end), does the *append* operation handle it well on a plain text file? I guess, that it might be problematic. Snappy is recommended to use with a container format, like Sequence

Re: Hadoop setup

2013-12-14 Thread Adam Kawa
...@gmail.com what makes difference in H/W selection, when we choosed yarn to install, and is necessary ? On 12/14/13, Adam Kawa kawa.a...@gmail.com wrote: In general, it is very open question and there are many possibilities depending on your workload (e.g. CPU-bound, IO-bound etc

Re: hadoop fs -text OutOfMemoryError

2013-12-13 Thread Adam Kawa
Hi, What is the value of HADOOP_CLIENT_OPTS in you hadoop-env.sh file? We had similar problems with running OOM with hadoop fs command (I do not remember if they were exactly related to -text + snappy), when we decreased the heap to some small value. With higher value e.g. 1 or 2 GB, we were

Re: Yarn -- one of the daemons getting killed

2013-12-13 Thread Adam Kawa
If you are interested, please read how we run into OOM-killer issue that was killing our TaskTrackers http://hakunamapdata.com/two-memory-related-issues-on-the-apache-hadoop-cluster/ (+ one issue related to heavy swapping). 2013/12/13 Vinod Kumar Vavilapalli vino...@hortonworks.com Yes, that

Re: how to handle the corrupt block in HDFS?

2013-12-11 Thread Adam Kawa
,but if the block just has 1 corrupt replica,hdfs fsck can not tell you which block of which file has a replica been corrupted,fsck just useful on all of one block's replica bad On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa kawa.a...@gmail.com wrote: When you identify a file with corrupt block(s), then you

Re: Job stuck in running state on Hadoop 2.2.0

2013-12-11 Thread Adam Kawa
to the previous launch errors?? Thanks in advance :) On 11 December 2013 00:29, Adam Kawa kawa.a...@gmail.com wrote: It sounds like the job was successfully submitted to the cluster, but there as some problem when starting/running AM, so that no progress is made. It happened to me once, when I

Re: Why is Hadoop always running just 4 tasks?

2013-12-11 Thread Adam Kawa
mapred.map.tasks is rather a hint to InputFormat ( http://wiki.apache.org/hadoop/HowManyMapsAndReduces) and it is ignored in your case. You process gz files, and InputFormat has isSplitatble method that for gz files it returns false, so that each map tasks process a whole file (this is related

Re: empty file

2013-12-11 Thread Adam Kawa
i have never seen something like that. Can you read that file? $ hadoop fs -text /tmp/corrupt_lzo/lc_hadoop16.1386270004881.lzo 2013/12/11 chenchun chenchun.f...@gmail.com Hi, I find some files on hdfs which command “hadoop fs -ls” tells they are not empty. But command “fsck” tells that

Re: Why is Hadoop always running just 4 tasks?

2013-12-11 Thread Adam Kawa
they should each uncompress individually From: Adam Kawa kawa.a...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Wednesday, December 11, 2013 9:33 PM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: Why is Hadoop always running just 4 tasks

Re: issue about Shuffled Maps in MR job summary

2013-12-11 Thread Adam Kawa
why sometime ,increase reducer number will not decrease job complete time ? Apart from valid information that Yong wrote in the previous point, please note that: 1) You do not want to have very shortly lived (seconds) reduce tasks, because the overhead for coordinating them, starting JVMs,

Re: secondary namenode is hang at post

2013-12-11 Thread Adam Kawa
It looks that it can not copy the new checkpoint into the NameNode. Can you copy-past what jstack says? $ sudo -u hdfs jstack snn-pid 2013/12/11 Patai Sangbutsarakum silvianhad...@gmail.com It just happens without changing anything in the cluster. Secondary namenode node has been working

Re: how to corrupt a replica of a block by manually?

2013-12-11 Thread Adam Kawa
Hmmm.. I guess that you can try to read this file using $ hadoop fs -cat file-with-corrupt-replica to detect that When reading a file its checksums are calculated and compared to checksums that were calculated during a write operation. If verification fails, NN is notified about the corrupt

Re: issue about running example job use custom mapreduce var

2013-12-10 Thread Adam Kawa
Please try 2013/12/10 ch huang justlo...@gmail.com hi,maillist: i try assign reduce number in commandline but seems not useful,i run tera sort like this # hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /alex/terasort/1G-input

Re: Versioninfo and platformName issue.

2013-12-10 Thread Adam Kawa
Hi, Do you have Hadoop libs properly installed? Does $ hadoop version command run successfully? If true, then It sounds like some classpath issue... 2013/12/10 Manish Bhoge manishbh...@rocketmail.com Sent from Rocket Mail via Android -- * From: * Manish Bhoge

Re: Job stuck in running state on Hadoop 2.2.0

2013-12-10 Thread Adam Kawa
It sounds like the job was successfully submitted to the cluster, but there as some problem when starting/running AM, so that no progress is made. It happened to me once, when I was playing with YARN on a cluster consisting of very small machines, and I mis-configured YARN to allocated to AM more

Re: multiusers in hadoop through LDAP

2013-12-10 Thread Adam Kawa
Please have a look at hadoop.security.group.mapping.ldap.* settings as Hardik Pandya suggests. = In advance, just to share our story related to LDAP + hadoop.security.group.mapping.ldap.*, if you run into the same limitation as we did: In many cases hadoop.security.group.mapping.ldap.*

Re: how to handle the corrupt block in HDFS?

2013-12-10 Thread Adam Kawa
Maybe this can work for you $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks ? 2013/12/11 ch huang justlo...@gmail.com thanks for reply, what i do not know is how can i locate the block which has the corrupt replica,(so i can observe how long the corrupt replica will be removed and a new

Re: how to handle the corrupt block in HDFS?

2013-12-10 Thread Adam Kawa
When you identify a file with corrupt block(s), then you can locate the machines that stores its block by typing $ sudo -u hdfs hdfs fsck path-to-file -files -blocks -locations 2013/12/11 Adam Kawa kawa.a...@gmail.com Maybe this can work for you $ sudo -u hdfs hdfs fsck / -list

Re: Write a file to local disks on all nodes of a YARN cluster.

2013-12-08 Thread Adam Kawa
I believe that you could do that through Puppet, or any tool that can remotely execute some command (e.g. pssh). 2013/12/8 Jay Vyas jayunit...@gmail.com I want to put a file on all nodes of my cluster, that is locally readable (not in HDFS). Assuming that i cant gaurantee a FUSE mount or

Re: mapreduce.jobtracker.expire.trackers.interval no effect

2013-12-05 Thread Adam Kawa
So i tried the deprecated parameter mapred.tasktracker.expiry.interval in my configuration and voila it works! Hansi, this is exactly the one parameter that I told you about in a previous post ;)

Re: issue about capacity scheduler

2013-12-05 Thread Adam Kawa
The heap of application master is controlled via yarn.app.mapreduce.am.command-opts and its default value is -Xmx1024m ( http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml ). yarn.scheduler.minimum-allocation-mb is completely different

Re: Debugging/Modifying HDFS from Eclipse

2013-12-05 Thread Adam Kawa
One blog post is here: http://grepalex.com/2012/10/20/hadoop-unit-testing-with-minimrcluster/ When I was playing with miniDFSCluster, and miniMRCluster, I was using them via HBaseTestingUtility (it can take a configuration object in a constructor

Re: Hadoop 2.2.0 from source configuration

2013-12-03 Thread Adam Kawa
Daniel, It looks that you can only communicate with NameNode to do metadata-only operations (e.g. listing, creating a dir, empty file)... Did you format the NameNode correctly? A quite similar issue is described here: http://www.manning-sandbox.com/thread.jspa?messageID=126741. The last reply

Re: issure about MR job on yarn framework

2013-12-03 Thread Adam Kawa
What command are you using to submit a job? If your job implements the ToolRunner, then you can use $ hadoop jar your.jar DriverClass -Dmapreduce.reduce.java.opts=-Xmx1024m input-dir output-dir We have two setting for controlling memory of map or reduce tasks e.g. for a map task

Re: Decomishining a node

2013-12-03 Thread Adam Kawa
I have override the InputFormat and set the isspitable to return false. My entire file has to go one mapper only as I set the issplitable false? Yes. 1) Could you double-check that you have only 1 input file in the input directory. 2) Did you configure your job to use your custom InputFormat

Re: Any reference for upgrade hadoop from 1.x to 2.2

2013-12-03 Thread Adam Kawa
@Nirmal, And later, you need to make a decision to finalize the upgrade or rollback. 2013/12/3 Adam Kawa kawa.a...@gmail.com @Nirmal, You need to run NameNode with upgrade option e.g. $ /usr/lib/hadoop/sbin/hadoop-daemon.sh start namenode -upgrade 2013/12/3 Nirmal Kumar nirmal.ku

Re: Unable to use third party jar

2013-12-02 Thread Adam Kawa
Could you please try: $ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:json-simple-1.1.1.jar $ hadoop jar domain_gold.jar org.select.Driver -libjars json-simple-1.1.1.jar $INPUT1 $INPUT2 $OUTPUT 2013/10/24 jamal sasha jamalsha...@gmail.com OOps..forgot the code: http://pastebin.com/7XnyVnkv

Re: du reserved can ran into problems with reserved disk capacity by tune2fs

2013-12-01 Thread Adam Kawa
We ran into issue as well on our cluster. +1 for JIRA for that Alexander, could you please create a JIRA in https://issues.apache.org/jira/browse/HDFS for that (it is your observation, so that you should get credit ;). Otherwise, I can do that. 2013/2/12 Alexander Fahlke

Re: Capacity Scheduler Issue

2013-11-28 Thread Adam Kawa
I see that you have different settings for ACL: nameyarn.scheduler.capacity.root.*default* .acl_submit_applications/namevalue*yarn,mapred* /value/propertyproperty acls = SUBMIT_APPLICATIONS:mapred,yarn ADMINISTER_QUEUE: [= configuredAcls ] nameyarn.scheduler.capacity.root.*dev*

Re: mapred.tasktacker.reduce.tasks.maximum issue

2013-11-27 Thread Adam Kawa
It looks that you have a typo in the names of configuration properties, so Hadoop ignores them and uses the default vaules (2 map and 2 reduce tasks per node). it should mapred.*tasktracker*.reduce.tasks.maximum not mapred.*tasktacker*.reduce.tasks.maximum (tasktRacker, not tasktacker) - the same

Re: org.apache.hadoop.mapred.TaskTracker: Caught exception: java.net.UnknownHostException: Invalid host name:

2013-11-27 Thread Adam Kawa
As far as I remember (we might run into such a issue ~6 months ago), the TaskTracker can cache the hostname of JobTracker. Try to restart a TaskTrackers, to check if it connects correctly. Please let me know, if restart of TT helped. 2013/11/15 kumar y ykk1...@gmail.com Hi, we changed the

Re: NN stopped and cannot recover with error There appears to be a gap in the edit log

2013-11-27 Thread Adam Kawa
Maybe you can play with the offline edits viewer. I have never run into such an issue, this I have never been playing with offline edits viewer on production datasets, but it has some options that could be perhaps useful when troubleshooting and fixing. [kawaa@localhost Desktop]$ hdfs oev Usage: