Re: Drawbacks of Hadoop Pipes
Hi there, I've been working with pipes for some months and I've finally managed to get it working as I wanted with some legacy code I had. However, I had many many issues regarding not only my implementation (it had to be adapted in several ways to fit pipes, it is very restrictive) but pipes itself (bugs, obscure errors and lack of proper logging with the subsequent mad debugging). I also tried streaming but I found it even more complex to debug and I found some deal-breaker errors that I couldn't overcome regarding buffering and such. I also tried a SWIG interface to wrap my code into a Java library, I'd never recommend that for you might end up introducing a lot of memory issues and potential bugs into your already working code, and you basically don't get anything useful from it. I've never worked with CUDA though, but it shouldn't be any different from my Hadoop Pipes deployment besides the specific libraries you need. Be prepared to deal with configuration issues and many esoteric logs, nevertheless. My advise, based in my experience, is that you should be 99% sure that your original code is solid before migrating to Hadoop Pipes, you will have enough problems there anyway. Good luck on your work :) Regards, Silvina On 3 March 2014 16:11, Basu,Indrashish indrash...@ufl.edu wrote: Hello, Anyone can help regarding the below query. Regards, Indrashish On Sat, 01 Mar 2014 13:52:11 -0500, Basu,Indrashish wrote: Hello, I am trying to execute a CUDA benchmark in a Hadoop Framework and using Hadoop Pipes for invoking the CUDA code which is written in a C++ interface from the Hadoop Framework. I am just a bit interested in knowing what can be the drawbacks of using Hadoop Pipes for this and whether the implementation of Hadoop Streaming and JNI interface will be a better choice. I am a bit unclear on this, so if anyone can throw some light on this and clarify. Regards, Indrashish -- Indrashish Basu Graduate Student Department of Electrical and Computer Engineering University of Florida
Re: Unable to export hadoop trunk into eclipse
Yes I installed.. mvn clean install -DskipTests was successful. Only import to eclipse is failing. On Tue, Mar 4, 2014 at 12:51 PM, Azuryy Yu azury...@gmail.com wrote: Have you installed protobuf on your computer? https://code.google.com/p/protobuf/downloads/list On Tue, Mar 4, 2014 at 3:08 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi Ted, I didn't do that earlier. Now , I did it mvn:eclipse eclipse and tried importing the projects same into eclipse. Now, this is throwing the following errors 1. No marketplace entries found to handle Execution compile-protoc, in hadoop-common/pom.xml in Eclipse. Please see Help for more information. 2. No marketplace entries found to handle Execution compile-protoc, in hadoop-hdfs/src/contrib/bkjournal/pom.xml in Eclipse. Please see Help for more information. Any idea ?? On Tue, Mar 4, 2014 at 10:59 AM, Ted Yu yuzhih...@gmail.com wrote: Have you run the following command under the root of your workspace ? mvn eclipse:eclipse On Mar 3, 2014, at 9:18 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi, I checked out the hadoop trunck from http://svn.apache.org/repos/asf/hadoop/common/trunk. I set up protobuf-2.5.0 and then did mvn build. mvn clean install -DskipTests .. worked well. Maven build was Successful. So, I tried importing the project into eclipse. It is showing errors in pom.xml of hadoop-common project. Below are the errors .. Can some one help me here ? Plugin execution not covered by lifecycle configuration: org.apache.hadoop:hadoop-maven-plugins: 3.0.0-SNAPSHOT:version-info (execution: version-info, phase: generate-resources The error is at line 299 of pom.xml in hadoop-common project. execution idversion-info/id phasegenerate-resources/phase goals goalversion-info/goal /goals configuration source directory${basedir}/src/main/directory includes includejava/**/*.java/include includeproto/**/*.proto/include /includes /source /configuration /execution execution There are multiple projects which failed of that error, hadoop-common is one such project. Regards, Nagarjuna K
decommissioning a node
Our cluster has a node that reboot randomly. So I've gone to Ambari, decommissioned its HDFS service, stopped all services, and deleted the node from the cluster. I expected and fsck to immediately show under-replicated blocks, but everything comes up fine. How do I tell the cluster that this node is really gone, and it should start replicating the missing blocks? Thanks John
RE: decommissioning a node
OK, restarting all services now fsck shows under-replication. Was it the NameNode restart? John From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Tuesday, March 04, 2014 5:47 AM To: user@hadoop.apache.org Subject: decommissioning a node Our cluster has a node that reboot randomly. So I've gone to Ambari, decommissioned its HDFS service, stopped all services, and deleted the node from the cluster. I expected and fsck to immediately show under-replicated blocks, but everything comes up fine. How do I tell the cluster that this node is really gone, and it should start replicating the missing blocks? Thanks John
Re: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields
Thank you for replay, I got it work. [hduser@vm38 ~]$ /usr/lib/hadoop-yarn/bin/yarn version Hadoop 2.2.0.2.0.6.0-101 Subversion g...@github.com:hortonworks/hadoop.git -r b07b2906c36defd389c8b5bd22bebc1bead8115b Compiled by jenkins on 2014-01-09T05:18Z Compiled with protoc 2.5.0 From source with checksum 704f1e463ebc4fb89353011407e965 This command was run using /usr/lib/hadoop/hadoop-common-2.2.0.2.0.6.0-101.jar [hduser@vm38 ~]$ The main problem I think was I had yarn binary in two places and I used wrong one that didn't use my yarn-site.xml. Every time I look into .staging/job.../job.xml there were values from sourceyarn-default.xml/source even I set them in yarn-site.xml. Typical mess up :) Tervitades, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) -BEGIN PUBLIC KEY- MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE 5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa BjM8j36yJvoBVsfOHQIDAQAB -END PUBLIC KEY- On 04/03/14 05:14, Rohith Sharma K S wrote: Hi The reason for org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet is hadoop is compiled with protoc-2.5.0 version, but in the classpath lower version of protobuf is present. 1. Check MRAppMaster classpath, which version of protobuf is in classpath. Expected to have 2.5.0 version. Thanks Regards Rohith Sharma K S -Original Message- From: Margusja [mailto:mar...@roo.ee] Sent: 03 March 2014 22:45 To: user@hadoop.apache.org Subject: Re: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields Hi 2.2.0 and 2.3.0 gave me the same container log. A little bit more details. I'll try to use external java client who submits job. some lines from maven pom.xml file: dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.3.0/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-core/artifactId version1.2.1/version /dependency lines from external client: ... 2014-03-03 17:36:01 INFO FileInputFormat:287 - Total input paths to process : 1 2014-03-03 17:36:02 INFO JobSubmitter:396 - number of splits:1 2014-03-03 17:36:03 INFO JobSubmitter:479 - Submitting tokens for job: job_1393848686226_0018 2014-03-03 17:36:04 INFO YarnClientImpl:166 - Submitted application application_1393848686226_0018 2014-03-03 17:36:04 INFO Job:1289 - The url to track the job: http://vm38.dbweb.ee:8088/proxy/application_1393848686226_0018/ 2014-03-03 17:36:04 INFO Job:1334 - Running job: job_1393848686226_0018 2014-03-03 17:36:10 INFO Job:1355 - Job job_1393848686226_0018 running in uber mode : false 2014-03-03 17:36:10 INFO Job:1362 - map 0% reduce 0% 2014-03-03 17:36:10 INFO Job:1375 - Job job_1393848686226_0018 failed with state FAILED due to: Application application_1393848686226_0018 failed 2 times due to AM Container for appattempt_1393848686226_0018_02 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) ... Lines from namenode: ... 14/03/03 19:12:42 INFO namenode.FSEditLog: Number of transactions: 900 Total time for transactions(ms): 69 Number of transactions batched in Syncs: 0 Number of syncs: 542 SyncTimes(ms): 9783 14/03/03 19:12:42 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073742050_1226 90.190.106.33:50010 14/03/03 19:12:42 INFO hdfs.StateChange: BLOCK* allocateBlock: /user/hduser/input/data666.noheader.data. BP-802201089-90.190.106.33-1393506052071 blk_1073742056_1232{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[90.190.106.33:50010|RBW]]} 14/03/03 19:12:44 INFO hdfs.StateChange: BLOCK* InvalidateBlocks: ask 90.190.106.33:50010 to delete
Need help: fsck FAILs, refuses to clean up corrupt fs
I have a file system with some missing/corrupt blocks. However, running hdfs fsck -delete also fails with errors. How do I get around this? Thanks John [hdfs@metallica yarn]$ hdfs fsck -delete /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld Connecting to namenode via http://anthrax.office.datalever.com:50070 FSCK started by hdfs (auth:SIMPLE) from /192.168.57.110 for path /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld at Tue Mar 04 06:05:40 MST 2014 . /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld: CORRUPT blockpool BP-1827033441-192.168.57.112-1384284857542 block blk_1074200714 /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld: CORRUPT blockpool BP-1827033441-192.168.57.112-1384284857542 block blk_1074200741 /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld: CORRUPT blockpool BP-1827033441-192.168.57.112-1384284857542 block blk_1074200778 /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld: MISSING 3 blocks of total size 299116266 B.Status: CORRUPT Total size:299116266 B Total dirs:0 Total files: 1 Total symlinks:0 Total blocks (validated): 3 (avg. block size 99705422 B) CORRUPT FILES:1 MISSING BLOCKS: 3 MISSING SIZE: 299116266 B CORRUPT BLOCKS: 3 Minimally replicated blocks: 0 (0.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 0.0 Corrupt blocks:3 Missing replicas: 0 Number of data-nodes: 8 Number of racks: 1 FSCK ended at Tue Mar 04 06:05:40 MST 2014 in 1 milliseconds FSCK ended at Tue Mar 04 06:05:40 MST 2014 in 1 milliseconds fsck encountered internal errors! Fsck on path '/rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld' FAILED
Question on DFS Balancing
Hi, I am new to the mailing list. I am using Hadoop 0.20.2 with an append r1056497 version. The question I have is related to balancing. I have a 5 datanode cluster and each node has 2 disks attached to it. The second disk was added when the first disk was reaching its capacity. Now the scenario that I am facing is, when the new disk was added hadoop automatically moved over some data to the new disk. But over the time I notice that data is no longer being written to the second disk. I have also faced an issue on the datanode where the first disk had 100% utilization. How can I overcome such scenario, is it not hadoop's job to balance the disk utilization between multiple disks on single datanode? Thanks Divye Sheth
Node manager or Resource Manager crash
Hi, I am running an application on a 2-node cluster, which tries to acquire all the containers that are available on one of those nodes and remaining containers from the other node in the cluster. When I run this application continuously in a loop, one of the NM or RM is getting killed at a random point. There is no corresponding message in the log files. One of the times that NM had got killed today, the tail of the it's log is like this: 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: isredeng:52867 sending out status for 16 containers 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true, And at the time of NM's crash, the RM's log has the following entries: 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type STATUS_UPDATE 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes. 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: nodeUpdate: isredeng:52867 clusterResources: memory:16384, vCores:16 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Node being looked for scheduling isredeng:52867 availableResource: memory:0, vCores:-8 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server: got #151 Note: the name of the node on which NM has got killed is isredeng, does it indicate anything from the above message as to why it got killed? Thanks, Kishore
Meaning of messages in log and debugging
Hello list, I'm currently debugging my Hadoop MR application and I have some general questions to the messages in the log and the debugging process. - What does Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 mean? What does 143 stand for? - I also see the following exception in the log: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744). What does this mean? It originates from a Diagnostics report from a container and the log4j message level is set to INFO. - Are there any related links which describe the life cycle of a container? - Is there a golden rule to debug a Hadoop MR application? - My application is very memory intense... is there any way to profile the memory consumption of a single container? Thanks! Best regards Yves signature.asc Description: OpenPGP digital signature
Re: [hadoop] AvroMultipleOutputs org.apache.avro.file.DataFileWriter$AppendWriteException
Outside hadoop: avro-1.7.6 Inside hadoop: avro-mapred-1.7.6-hadoop2 From: Stanley Shi s...@gopivotal.commailto:s...@gopivotal.com Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Date: Monday, March 3, 2014 at 8:30 PM To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re: [hadoop] AvroMultipleOutputs org.apache.avro.file.DataFileWriter$AppendWriteException which avro version are you using when running outside of hadoop? Regards, Stanley Shi, [http://www.gopivotal.com/files/media/logos/pivotal-logo-email-signature.png] On Mon, Mar 3, 2014 at 11:49 PM, John Pauley john.pau...@threattrack.commailto:john.pau...@threattrack.com wrote: This is cross posted to avro-user list (http://mail-archives.apache.org/mod_mbox/avro-user/201402.mbox/%3ccf3612f6.94d2%25john.pau...@threattrack.com%3e). Hello all, I’m having an issue using AvroMultipleOutputs in a map/reduce job. The issue occurs when using a schema that has a union of null and a fixed (among other complex types), default to null, and it is not null. Please find the full stack trace below and a sample map/reduce job that generates an Avro container file and uses that for the m/r input. Note that I can serialize/deserialize without issue using GenericDatumWriter/GenericDatumReader outside of hadoop… Any insight would be helpful. Stack trace: java.lang.Exception: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.NullPointerException: in com.foo.bar.simple_schema in union null of union in field baz of com.foo.bar.simple_schema at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404) Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.NullPointerException: in com.foo.bar.simple_schema in union null of union in field baz of com.foo.bar.simple_schema at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:296) at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:77) at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:39) at org.apache.avro.mapreduce.AvroMultipleOutputs.write(AvroMultipleOutputs.java:400) at org.apache.avro.mapreduce.AvroMultipleOutputs.write(AvroMultipleOutputs.java:378) at com.tts.ox.mapreduce.example.avro.AvroContainerFileDriver$SampleMapper.map(AvroContainerFileDriver.java:78) at com.tts.ox.mapreduce.example.avro.AvroContainerFileDriver$SampleMapper.map(AvroContainerFileDriver.java:62) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Caused by: java.lang.NullPointerException: in com.foo.bar.simple_schema in union null of union in field baz of com.foo.bar.simple_schema at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:145) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:290) ... 16 more Caused by: java.lang.NullPointerException at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:457) at org.apache.avro.specific.SpecificData.getSchema(SpecificData.java:189) at org.apache.avro.reflect.ReflectData.isRecord(ReflectData.java:167) at org.apache.avro.generic.GenericData.getSchemaName(GenericData.java:608) at org.apache.avro.specific.SpecificData.getSchemaName(SpecificData.java:265) at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:597) at org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:151) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:71) at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143) at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114) at org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:175) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143) Sample m/r job: mr_job package com.tts.ox.mapreduce.example.avro; import org.apache.avro.Schema; import org.apache.avro.file.DataFileWriter; import
RE: Need help: fsck FAILs, refuses to clean up corrupt fs
More information from the NameNode log. I don't understand... it is saying that I cannot delete the corrupted file until the NameNode leaves safe mode, but it won't leave safe mode until the file system is no longer corrupt. How do I get there from here? Thanks john 2014-03-04 06:02:51,584 ERROR namenode.NameNode (NamenodeFsck.java:deleteCorruptedFile(446)) - Fsck: error deleting corrupted file /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld. Name node is in safe mode. The reported blocks 169302 needs additional 36 blocks to reach the threshold 1. of total blocks 169337. Safe mode will be turned off automatically at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1063) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3141) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3101) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3085) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:697) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.deleteCorruptedFile(NamenodeFsck.java:443) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:426) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.fsck(NamenodeFsck.java:206) at org.apache.hadoop.hdfs.server.namenode.FsckServlet$1.run(FsckServlet.java:67) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.hdfs.server.namenode.FsckServlet.doGet(FsckServlet.java:58) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Tuesday, March 04, 2014 6:08 AM To: user@hadoop.apache.org Subject: Need help: fsck FAILs, refuses to clean up corrupt fs I have a file system with some missing/corrupt blocks. However, running hdfs fsck -delete also fails with errors. How do I get around this? Thanks John [hdfs@metallica yarn]$ hdfs fsck -delete
RE: Need help: fsck FAILs, refuses to clean up corrupt fs
Ah... found the answer. I had to manually leave safe mode to delete the corrupt files. john From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Tuesday, March 04, 2014 9:33 AM To: user@hadoop.apache.org Subject: RE: Need help: fsck FAILs, refuses to clean up corrupt fs More information from the NameNode log. I don't understand... it is saying that I cannot delete the corrupted file until the NameNode leaves safe mode, but it won't leave safe mode until the file system is no longer corrupt. How do I get there from here? Thanks john 2014-03-04 06:02:51,584 ERROR namenode.NameNode (NamenodeFsck.java:deleteCorruptedFile(446)) - Fsck: error deleting corrupted file /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld. Name node is in safe mode. The reported blocks 169302 needs additional 36 blocks to reach the threshold 1. of total blocks 169337. Safe mode will be turned off automatically at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1063) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3141) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3101) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3085) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:697) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.deleteCorruptedFile(NamenodeFsck.java:443) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:426) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.fsck(NamenodeFsck.java:206) at org.apache.hadoop.hdfs.server.namenode.FsckServlet$1.run(FsckServlet.java:67) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.hdfs.server.namenode.FsckServlet.doGet(FsckServlet.java:58) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Tuesday, March 04, 2014 6:08 AM To:
RE: Need help: fsck FAILs, refuses to clean up corrupt fs
You can force namenode to leave safemode. hadoop dfsadmin -safemode leave Then run the hadoop fsck. Thanks Divye Sheth On Mar 4, 2014 10:03 PM, John Lilley john.lil...@redpoint.net wrote: More information from the NameNode log. I don't understand... it is saying that I cannot delete the corrupted file until the NameNode leaves safe mode, but it won't leave safe mode until the file system is no longer corrupt. How do I get there from here? Thanks john 2014-03-04 06:02:51,584 ERROR namenode.NameNode (NamenodeFsck.java:deleteCorruptedFile(446)) - Fsck: error deleting corrupted file /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /rpdm/tmp/ProjectTemp_461_40/TempFolder_4/data00012_00.dld. Name node is in safe mode. The reported blocks 169302 needs additional 36 blocks to reach the threshold 1. of total blocks 169337. Safe mode will be turned off automatically at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1063) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3141) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3101) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3085) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:697) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.deleteCorruptedFile(NamenodeFsck.java:443) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:426) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.check(NamenodeFsck.java:289) at org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.fsck(NamenodeFsck.java:206) at org.apache.hadoop.hdfs.server.namenode.FsckServlet$1.run(FsckServlet.java:67) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.hdfs.server.namenode.FsckServlet.doGet(FsckServlet.java:58) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) *From:* John Lilley [mailto:john.lil...@redpoint.net] *Sent:* Tuesday, March 04, 2014 6:08
Re: Hadoop Jobtracker cluster summary of heap size and OOME
join the group On Fri, Oct 11, 2013 at 10:28 PM, Viswanathan J jayamviswanat...@gmail.comwrote: Hi, I'm running a 14 nodes Hadoop cluster with tasktrackers running in all nodes. Have set the jobtracker default memory size in hadoop-env.sh *HADOOP_HEAPSIZE=1024* Have set the mapred.child.java.opts value in mapred-site.xml as, property namemapred.child.java.opts/name value-Xmx2048m/value -- Regards, Viswa.J -- --- You received this message because you are subscribed to the Google Groups CDH Users group. To unsubscribe from this group and stop receiving emails from it, send an email to cdh-user+unsubscr...@cloudera.org. For more options, visit https://groups.google.com/a/cloudera.org/groups/opt_out. -- Regards. Vikas S Pabale. +919730198004
Re: Node manager or Resource Manager crash
I remember you asking this question before. Check if your OS' OOM killer is killing it. +Vinod On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, I am running an application on a 2-node cluster, which tries to acquire all the containers that are available on one of those nodes and remaining containers from the other node in the cluster. When I run this application continuously in a loop, one of the NM or RM is getting killed at a random point. There is no corresponding message in the log files. One of the times that NM had got killed today, the tail of the it's log is like this: 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: isredeng:52867 sending out status for 16 containers 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true, And at the time of NM's crash, the RM's log has the following entries: 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type STATUS_UPDATE 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes. 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: nodeUpdate: isredeng:52867 clusterResources: memory:16384, vCores:16 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Node being looked for scheduling isredeng:52867 availableResource: memory:0, vCores:-8 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server: got #151 Note: the name of the node on which NM has got killed is isredeng, does it indicate anything from the above message as to why it got killed? Thanks, Kishore -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Not information in Job History UI
That explains a lot. Thanks for the information. I appreciate your help. On Mon, Mar 3, 2014 at 7:47 PM, Jian He j...@hortonworks.com wrote: You said, there are no job logs generated on the server that is running the job.. that was quoting your previous sentence and answer your question.. If I were to run a job and I wanted to tail the job log as it was running, where would I find that log? 1) set yarn.nodemanager.delete.debug-delay-sec to be a larger value, and look for logs in local dirs specified by yarn.nodemanager.log-dirs. Or 2) enable log aggregation yarn.log-aggregation-enable. Log aggregation is to aggregate those NM local logs and upload them to HDFS once application is finished.Then you can use yarn logs command or simply go the history UI to see the logs. You can find good explanation from http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/ Thanks. On Mon, Mar 3, 2014 at 4:29 PM, SF Hadoop sfhad...@gmail.com wrote: Thanks for that info Jian. You said, there are no job logs generated on the server that is running the job.. So am I correct in assuming the logs will be in the dir specified by yarn.nodemanager.log-dirs on the datanodes? I am quite confused as to where the logs for each specific part of the ecosystem reside. If I were to run a job and I wanted to tail the job log as it was running, where would I find that log? Thanks for your help. On Mon, Mar 3, 2014 at 11:46 AM, Jian He j...@hortonworks.com wrote: Note that node manager will not keep the finished applications and only show running apps, so the UI won't show the finished apps. Conversely, job history server UI will only show the finished apps but not the running apps. bq. there are no job logs generated on the server that is running the job. by default, the local logs will be deleted after job finished. you can config yarn.nodemanager.delete.debug-delay-sec, to delay the deletion of the logs. Jian On Mon, Mar 3, 2014 at 10:45 AM, SF Hadoop sfhad...@gmail.com wrote: Hadoop 2.2.0 CentOS 6.4 Viewing UI in various browsers. I am having a problem where no information is visible in my Job History UI. I run test jobs, they complete without error, but no information ever populates the nodemanager or jobhistory server UI. Also, there are no job logs generated on the server that is running the job. I have the following settings configured: yarn.nodemanager.local-dirs yarn.nodemanager.log-dirs yarn.log.server.url ...plus the basic yarn log dir. I get output in regards to the daemons but very little in regards to the job. All I get that refers to the jobhistory server is the following (so it appears to be functioning properly): 2014-02-18 11:43:06,824 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 19888 2014-02-18 11:43:06,824 INFO org.mortbay.log: jetty-6.1.26 2014-02-18 11:43:06,847 INFO org.mortbay.log: Extract jar:file:/usr/lib/hadoop-yarn/hadoop-yarn-common-2.1.0.2.0.5.0-67.jar!/webapps/jobhistory to /tmp/Jetty_server_19888_jobhistoryv7gnnv/webapp 2014-02-18 11:43:07,085 INFO org.mortbay.log: Started SelectChannelConnector@server:19888 2014-02-18 11:43:07,085 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app /jobhistory started at 19888 2014-02-18 11:43:07,477 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules I have a feeling this is a misconfiguration but I cannot figure out what setting is missing or wrong. Other than not being able to see any of the jobs in the UIs, everything appears to be working correctly so this is quite confusing. Any help is appreciated. CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Meaning of messages in log and debugging
bq. Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 mean? What does 143 stand for? It's the diagnostic message generated by YARN, which indicates the container is killed by MR's ApplicationMaster. 143 is a exit code of an YARN container, which indicates the termination of a container. bq. Are there any related links which describe the life cycle of a container? This is what I found online: http://diggerk.wordpress.com/2013/09/19/lifecycle-of-yarn-resource-manager-containers/. Otherwise, you can have a look at ContainerImpl.java if you want to know the detail. bq. My application is very memory intense... is there any way to profile the memory consumption of a single container? You can find the metrics info RM and NM web UI, or you can programmatically access the RESTful APIs. - Zhijie On Tue, Mar 4, 2014 at 7:24 AM, Yves Weissig weis...@uni-mainz.de wrote: Hello list, I'm currently debugging my Hadoop MR application and I have some general questions to the messages in the log and the debugging process. - What does Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 mean? What does 143 stand for? - I also see the following exception in the log: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744). What does this mean? It originates from a Diagnostics report from a container and the log4j message level is set to INFO. - Are there any related links which describe the life cycle of a container? - Is there a golden rule to debug a Hadoop MR application? - My application is very memory intense... is there any way to profile the memory consumption of a single container? Thanks! Best regards Yves -- Zhijie Shen Hortonworks Inc. http://hortonworks.com/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Benchmarking Hive Changes
I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox. I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown. SELECT s07.description, s07.salary, s08.salary, s08.salary - s07.salary FROM sample_07 s07 JOIN sample_08 s08 ON ( s07.code = s08.code) WHERE s07.salary s08.salary SORT BY s08.salary-s07.salary DESC Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.
Re: Node manager or Resource Manager crash
Yes Vinod, I was asking this question sometime back, and I got back to resolve the issue again. I tried to see if the OOM is killing but it is not. I have checked the free swap space on my box while my test is going on, but it doesn't seem to be the issue. Also, I have verified if OOM score is going high for any of these process because that is when OOM killer kills them, but they are not going high too. Thanks, Kishore On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli vino...@apache.org wrote: I remember you asking this question before. Check if your OS' OOM killer is killing it. +Vinod On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, I am running an application on a 2-node cluster, which tries to acquire all the containers that are available on one of those nodes and remaining containers from the other node in the cluster. When I run this application continuously in a loop, one of the NM or RM is getting killed at a random point. There is no corresponding message in the log files. One of the times that NM had got killed today, the tail of the it's log is like this: 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: isredeng:52867 sending out status for 16 containers 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true, And at the time of NM's crash, the RM's log has the following entries: 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type STATUS_UPDATE 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes. 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: nodeUpdate: isredeng:52867 clusterResources: memory:16384, vCores:16 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Node being looked for scheduling isredeng:52867 availableResource: memory:0, vCores:-8 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server: got #151 Note: the name of the node on which NM has got killed is isredeng, does it indicate anything from the above message as to why it got killed? Thanks, Kishore CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Question on DFS Balancing
You're probably looking for https://issues.apache.org/jira/browse/HDFS-1804 On Tue, Mar 4, 2014 at 5:54 AM, divye sheth divs.sh...@gmail.com wrote: Hi, I am new to the mailing list. I am using Hadoop 0.20.2 with an append r1056497 version. The question I have is related to balancing. I have a 5 datanode cluster and each node has 2 disks attached to it. The second disk was added when the first disk was reaching its capacity. Now the scenario that I am facing is, when the new disk was added hadoop automatically moved over some data to the new disk. But over the time I notice that data is no longer being written to the second disk. I have also faced an issue on the datanode where the first disk had 100% utilization. How can I overcome such scenario, is it not hadoop's job to balance the disk utilization between multiple disks on single datanode? Thanks Divye Sheth -- Harsh J
Re: [hadoop] AvroMultipleOutputs org.apache.avro.file.DataFileWriter$AppendWriteException
Which version of hadoop are you using? There's a possibility that the hadoop environment already have a avro**.jar in place, thus caused the jar conflict. Regards, *Stanley Shi,* On Tue, Mar 4, 2014 at 11:25 PM, John Pauley john.pau...@threattrack.comwrote: Outside hadoop: avro-1.7.6 Inside hadoop: avro-mapred-1.7.6-hadoop2 From: Stanley Shi s...@gopivotal.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Monday, March 3, 2014 at 8:30 PM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: [hadoop] AvroMultipleOutputs org.apache.avro.file.DataFileWriter$AppendWriteException which avro version are you using when running outside of hadoop? Regards, *Stanley Shi,* On Mon, Mar 3, 2014 at 11:49 PM, John Pauley john.pau...@threattrack.comwrote: This is cross posted to avro-user list ( http://mail-archives.apache.org/mod_mbox/avro-user/201402.mbox/%3ccf3612f6.94d2%25john.pau...@threattrack.com%3e ). Hello all, I’m having an issue using AvroMultipleOutputs in a map/reduce job. The issue occurs when using a schema that has a union of null and a fixed (among other complex types), default to null, and it is not null. Please find the full stack trace below and a sample map/reduce job that generates an Avro container file and uses that for the m/r input. Note that I can serialize/deserialize without issue using GenericDatumWriter/GenericDatumReader outside of hadoop… Any insight would be helpful. Stack trace: java.lang.Exception: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.NullPointerException: in com.foo.bar.simple_schema in union null of union in field baz of com.foo.bar.simple_schema at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404) Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.NullPointerException: in com.foo.bar.simple_schema in union null of union in field baz of com.foo.bar.simple_schema at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:296) at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:77) at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:39) at org.apache.avro.mapreduce.AvroMultipleOutputs.write(AvroMultipleOutputs.java:400) at org.apache.avro.mapreduce.AvroMultipleOutputs.write(AvroMultipleOutputs.java:378) at com.tts.ox.mapreduce.example.avro.AvroContainerFileDriver$SampleMapper.map(AvroContainerFileDriver.java:78) at com.tts.ox.mapreduce.example.avro.AvroContainerFileDriver$SampleMapper.map(AvroContainerFileDriver.java:62) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Caused by: java.lang.NullPointerException: in com.foo.bar.simple_schema in union null of union in field baz of com.foo.bar.simple_schema at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:145) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:290) ... 16 more Caused by: java.lang.NullPointerException at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:457) at org.apache.avro.specific.SpecificData.getSchema(SpecificData.java:189) at org.apache.avro.reflect.ReflectData.isRecord(ReflectData.java:167) at org.apache.avro.generic.GenericData.getSchemaName(GenericData.java:608) at org.apache.avro.specific.SpecificData.getSchemaName(SpecificData.java:265) at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:597) at org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:151) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:71) at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143) at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114) at org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:175) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:143) Sample m/r job: mr_job package com.tts.ox.mapreduce.example.avro; import
Re: Question on DFS Balancing
Thanks Harsh. The jira is fixed in version 2.1.0 whereas I am using Hadoop 0.20.2 (we are in a process of upgrading) is there a workaround for the short term to balance the disk utilization? The patch in the Jira, if applied to the version that I am using, will it break anything? Thanks Divye Sheth On Wed, Mar 5, 2014 at 11:28 AM, Harsh J ha...@cloudera.com wrote: You're probably looking for https://issues.apache.org/jira/browse/HDFS-1804 On Tue, Mar 4, 2014 at 5:54 AM, divye sheth divs.sh...@gmail.com wrote: Hi, I am new to the mailing list. I am using Hadoop 0.20.2 with an append r1056497 version. The question I have is related to balancing. I have a 5 datanode cluster and each node has 2 disks attached to it. The second disk was added when the first disk was reaching its capacity. Now the scenario that I am facing is, when the new disk was added hadoop automatically moved over some data to the new disk. But over the time I notice that data is no longer being written to the second disk. I have also faced an issue on the datanode where the first disk had 100% utilization. How can I overcome such scenario, is it not hadoop's job to balance the disk utilization between multiple disks on single datanode? Thanks Divye Sheth -- Harsh J
Re: Question on DFS Balancing
Hi, That probably break something if you apply the patch from 2.x to 0.20.x, but it depends on. AFAIK, Balancer had a major refactor in HDFSv2, so you'd better fix it by yourself based on HDFS-1804. On Wed, Mar 5, 2014 at 3:47 PM, divye sheth divs.sh...@gmail.com wrote: Thanks Harsh. The jira is fixed in version 2.1.0 whereas I am using Hadoop 0.20.2 (we are in a process of upgrading) is there a workaround for the short term to balance the disk utilization? The patch in the Jira, if applied to the version that I am using, will it break anything? Thanks Divye Sheth On Wed, Mar 5, 2014 at 11:28 AM, Harsh J ha...@cloudera.com wrote: You're probably looking for https://issues.apache.org/jira/browse/HDFS-1804 On Tue, Mar 4, 2014 at 5:54 AM, divye sheth divs.sh...@gmail.com wrote: Hi, I am new to the mailing list. I am using Hadoop 0.20.2 with an append r1056497 version. The question I have is related to balancing. I have a 5 datanode cluster and each node has 2 disks attached to it. The second disk was added when the first disk was reaching its capacity. Now the scenario that I am facing is, when the new disk was added hadoop automatically moved over some data to the new disk. But over the time I notice that data is no longer being written to the second disk. I have also faced an issue on the datanode where the first disk had 100% utilization. How can I overcome such scenario, is it not hadoop's job to balance the disk utilization between multiple disks on single datanode? Thanks Divye Sheth -- Harsh J