Re: is it possible to concatenate output files under many reducers?
yes. that is a general solution to control counts of output files. however, if you need to control counts of outputs dynamically, how could you do? if an output file name is 'A', counts of this output files are needed to be 5. if an output file name is 'B', counts of this output files are needed to be 10. is it able to be under hadoop? Junyoung Kim (juneng...@gmail.com) On 05/12/2011 02:17 PM, Harsh J wrote: Short, blind answer: You could run 10 reducers. Otherwise, you'll have to run another job that picks up a few files each in mapper and merges them out. But having 60 files shouldn't really be a problem if they are sufficiently large (at least 80% of a block size perhaps -- you can tune # of reducers to achieve this). On Thu, May 12, 2011 at 6:14 AM, Jun Young Kimjuneng...@gmail.com wrote: hi, all. I have 60 reducers which are generating same output files. from output-r--1 to output-r-00059. under this situation, I want to control the count of output files. for example, is it possible to concatenate all output files to 10 ? from output-r-1 to output-r-00010. thanks -- Junyoung Kim (juneng...@gmail.com)
is it possible to concatenate output files under many reducers?
hi, all. I have 60 reducers which are generating same output files. from output-r--1 to output-r-00059. under this situation, I want to control the count of output files. for example, is it possible to concatenate all output files to 10 ? from output-r-1 to output-r-00010. thanks -- Junyoung Kim (juneng...@gmail.com)
how am I able to get output file names?
hi, after completing a job, I want to know the output file names because I used MultipleOutoutput class to generate several output files. do you know how I can get it? thanks. -- Junyoung Kim (juneng...@gmail.com)
is a single thread allocated to a single output file ?
hi, is a single thread allocated to a single output file when a job is trying to write multiple output files? if counts of output files are 10,000, does a hadoop try to create threads for each output file? -- Junyoung Kim (juneng...@gmail.com)
what's the differences between file.blocksize and dfs.blocksize in a job.xml?
hi, I am wondering the concepts of file.blocksize and dfs.blocksize. in hdfs-site.xml, I set property namedfs.block.size/name value536870912/value finaltrue/final /property in job.xml, I found *file.blocksize*67108864 *dfs.blocksize* 536870912 dfs browser's page *Name* *Type* *Size* *Replication* *Block Size* *Modification Time* *Permission* *Owner* *Group* *20110309160005 http://thadps06.scast.nhnsystem.com:50075/browseDirectory.jsp?dir=%2Fuser%2Firteam%2F20110309160005namenodeInfoPort=50070delegation=null* *dir* *2011-03-09 16:51* *rwxr-xr-x* *test* *supergroup* *all0307.ep http://thadps06.scast.nhnsystem.com:50075/browseDirectory.jsp?dir=%2Fuser%2Firteam%2Fall0307.epnamenodeInfoPort=50070delegation=null* *file* *21.53 GB* *2* *64 MB* *2011-03-09 15:58* *rw-r--r--* *test* *supergroup* *all0307.svc http://thadps06.scast.nhnsystem.com:50075/browseDirectory.jsp?dir=%2Fuser%2Firteam%2Fall0307.svcnamenodeInfoPort=50070delegation=null* *file* *21.53 GB* *2* *64 MB* *2011-03-09 15:13* *rw-r--r--* *test* *supergroup* total size of inputs of a job is about 44GB(all0307.ep + all0307.svc). in the step of maping, the split's numbers are 690. (that means a map task took a single block size as 64MB). I thought the splits counts should be about 88 because a single block size is 512MB and input file's size are 44GB). How could I get the result I want? thanks. -- Junyoung Kim (juneng...@gmail.com)
How to count rows of output files ?
Hi. my hadoop application generated several output files by a single job. (for example, A, B, C are generated as a result) after finishing a job, I want to count each files' row counts. is there any way to count each files? thanks. -- Junyoung Kim (juneng...@gmail.com)
is this warning messages considerable to fix ?
hi, under an single hadoop job execution, I got several these errors from mapper. Another (possibly speculative) attempt already SUCCEEDED is this able to cause errors ?-- Junyoung Kim (juneng...@gmail.com)
Re: is there more smarter way to execute a hadoop cluster?
hello, harsh. to use MultipleOutput class, I need to use a Job class to set it as a first argument to configure about my hadoop job. |*addNamedOutput http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html#addNamedOutput%28org.apache.hadoop.mapreduce.Job,%20java.lang.String,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class%29*(Job http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/Job.html job,String http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true namedOutput,Class http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? extendsOutputFormat http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/OutputFormat.html outputFormatClass,Class http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? keyClass,Class http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? valueClass)| Adds a named output for the job. AYK, Job class is deprecated in 0.21.0. to submit my job in a cluster like runJob(). How are I going to do? Junyoung Kim (juneng...@gmail.com) On 02/24/2011 04:12 PM, Harsh J wrote: Hello, On Thu, Feb 24, 2011 at 12:25 PM, Jun Young Kimjuneng...@gmail.com wrote: Hi, I executed my cluster by this way. call a command in shell directly. What are you doing within your testCluster.jar? If you are simply submitting a job, you can use a Driver method and get rid of all these hassles. JobClient and Job classes both support submitting jobs from Java API itself. Please read the tutorial on submitting application code via code itself: http://developer.yahoo.com/hadoop/tutorial/module4.html#driver Notice the last line in the code presented there, which submits a job itself. Using runJob() also prints your progress/counters etc. The way you've implemented this looks unnecessary when your Jar itself can be made runnable with a Driver!
Re: is there more smarter way to execute a hadoop cluster?
Now, I am using Job.waitForCompletion(bool) method to submit my job. but, my jar cannot open hdfs files. and also after submitting my job, I couldn't look job history on admin pages(jobtracker.jsp) even if my job is succeeded.. for example) I set the input path as hdfs:/user/juneng/1.input. but, look this error.. Wrong FS: hdfs:/user/juneng/1.input, expected: file:/// Junyoung Kim (juneng...@gmail.com) On 02/24/2011 06:41 PM, Harsh J wrote: In new API, 'Job' class too has a Job.submit() and Job.waitForCompletion(bool) method. Please see the API here: http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html
Re: is there more smarter way to execute a hadoop cluster?
hi, I got the reason of my problem. in case of submitting a job by shell, conf.get(fs.default.name) is hdfs://localhost in case of submitting a job by a java application directly, conf.get(fs.default.name) is file://localhost so I couldn't read any files from hdfs. I think the execution of my java app couldn't read *-site.xml configurations properly. Junyoung Kim (juneng...@gmail.com) On 02/24/2011 06:41 PM, Harsh J wrote: Hey, On Thu, Feb 24, 2011 at 2:36 PM, Jun Young Kimjuneng...@gmail.com wrote: How are I going to do? In new API, 'Job' class too has a Job.submit() and Job.waitForCompletion(bool) method. Please see the API here: http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html
Re: is there more smarter way to execute a hadoop cluster?
Hi, Harsh. I've already tried to do use final tag to set it unmodifiable. but, my result is not different. *core-site.xml:* configuration property namefs.default.name/name valuehdfs://localhost/value finaltrue/final /property /configuration other *-site.xml files are also modified by this rule. thanks. Junyoung Kim (juneng...@gmail.com) On 02/25/2011 02:50 PM, Harsh J wrote: Hi, On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kimjuneng...@gmail.com wrote: hi, I got the reason of my problem. in case of submitting a job by shell, conf.get(fs.default.name) is hdfs://localhost in case of submitting a job by a java application directly, conf.get(fs.default.name) is file://localhost so I couldn't read any files from hdfs. I think the execution of my java app couldn't read *-site.xml configurations properly. Have a look at this Q: http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F
Re: is there more smarter way to execute a hadoop cluster?
hello, harsh. do you mean I need to read xml files and then parse it to set in my app? Junyoung Kim (juneng...@gmail.com) On 02/25/2011 03:32 PM, Harsh J wrote: It is best if your application gets the right configuration files on its classpath itself, so that the right values are read (how else would it know your values!).
is there more smarter way to execute a hadoop cluster?
Hi, I executed my cluster by this way. call a command in shell directly. String runInCommand =/opt/hadoop-0.21.0/bin/hadoop jar testCluster.jar example; Process proc = Runtime.getRuntime().exec(runInCommand); proc.waitFor(); BufferedReader in = new BufferedReader(new InputStreamReader(proc.getErrorStream())); for (String str; (str = in.readLine()) != null;) System.out.println(str); System.exit(0); but, in a hadoop script, it calls the RunJar() class to deploy testCluster.jar file. isn't it? is there more smarter way to execute a hadoop cluster? thanks. -- Junyoung Kim (juneng...@gmail.com)
How I can assume the proper a block size if the input file size is dynamic?
hi, all. I know dfs.blocksize key can affect the performance of a hadoop. in my case, I have thousands of directories which are including so many different sized input files. (file sizes are from 10K to 1G). in this case, How I can assume the dfs.blocksize to get a best performance? 11/02/22 17:45:49 INFO input.FileInputFormat: Total input paths to process : *15407* 11/02/22 17:45:54 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 11/02/22 17:45:54 INFO mapreduce.JobSubmitter: number of splits:*15411* 11/02/22 17:45:54 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null 11/02/22 17:45:54 INFO mapreduce.Job: Running job: job_201102221737_0002 11/02/22 17:45:55 INFO mapreduce.Job: map 0% reduce 0% thanks. -- Junyoung Kim (juneng...@gmail.com)
Re: How I can assume the proper a block size if the input file size is dynamic?
currenly, I got a problem to reduce the output of mappers. 11/02/23 09:57:45 INFO input.FileInputFormat: Total input paths to process : 4157 11/02/23 09:57:47 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 11/02/23 09:57:47 INFO mapreduce.JobSubmitter: number of splits:4309 input file sizes are so dynamic now. based on this files, a hadoop creates so many splits to map them. here is my result of M/R. Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks Killed tasks Start Time Finish Time Setup 1 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_SETUPstatus=all 1 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_SETUPstatus=SUCCEEDED 0 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_SETUPstatus=FAILED 0 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_SETUPstatus=KILLED 22-2-2011 22:10:07 22-2-2011 22:10:08 (1sec) Map 4309 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=MAPstatus=all 4309 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=MAPstatus=SUCCEEDED 0 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=MAPstatus=FAILED 0 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=MAPstatus=KILLED 22-2-2011 22:10:11 22-2-2011 22:18:51 (8mins, 40sec) Reduce 5 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=REDUCEstatus=all 0 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=REDUCEstatus=SUCCEEDED 4 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=REDUCEstatus=FAILED 1 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=REDUCEstatus=KILLED 22-2-2011 22:11:00 22-2-2011 22:36:51 (25mins, 50sec) Cleanup 1 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_CLEANUPstatus=all 1 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_CLEANUPstatus=SUCCEEDED 0 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_CLEANUPstatus=FAILED 0 http://thadpm01.scast.shopping.naver.com:50030/jobtaskshistory.jsp?logFile=file:/home1/irteam/naver/hadoop/logs/history/done/job_20110050_0003_irteamtaskType=JOB_CLEANUPstatus=KILLED%3E 22-2-2011 22:36:47 22-2-2011 22:37:51 (1mins, 4sec) in the step of Reduce. there are failed/killed tasks. the reason of them are this. org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#3 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:58) at org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:45) at org.apache.hadoop.mapreduce.task.reduce.MapOutput.init(MapOutput.java:104) at org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267) at org.apache.hadoop.mapreduce.task.re yes. it's from shuttle procedures. I think the problem
Re: how many output files can support by MultipleOutputs?
10:24:44 INFO mapreduce.Job: map 21% reduce 0% 11/02/22 10:24:54 INFO mapreduce.Job: map 22% reduce 0% thanks. Junyoung Kim (juneng...@gmail.com) On 02/21/2011 10:47 AM, Yifeng Jiang wrote: We were using 0.20.2 when the issue occurred, then we set it to 2048, and the failure was fixed. Now we are using 0.20-append (HBase requires it), it works well too. On 2011/02/21 10:35, Jun Young Kim wrote: hi, yifeng. Coung I know which version of a hadoop you are using? thanks for your response. Junyoung Kim (juneng...@gmail.com) On 02/21/2011 10:28 AM, Yifeng Jiang wrote: Hi, We have met the same issue. It seems that this error occurs, when the threads connected to the Datanode reaches the maximum # of server threads, defined by dfs.datanode.max.xcievers in hdfs-site.xml Our solution is to increase the it from the default value (256) to a bigger one, such as 2048. On 2011/02/21 10:17, Jun Young Kim wrote: hi, in an application, I read many files in many directories. additionally, by using MultipleOutputs class, I try to write thousands of output files in many directories. during reduce processing(reduce task count is 1), almost my job(average job counts in parallel are 20) are failed. almost error types are like java.io.IOException: Bad connect ack with firstBadLink as 10.25.241.101:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error while doing final merge at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_869.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) at org.apache.hadoop.mapreduce.task.reduce.MergeMa currenly, I suspect this is caused by limitations of hadoop to support output file descriptor count. (I am using a linux server to support this job, server configuration is $ cat /proc/sys/fs/file-max 327680
how many output files can support by MultipleOutputs?
hi, in an application, I read many files in many directories. additionally, by using MultipleOutputs class, I try to write thousands of output files in many directories. during reduce processing(reduce task count is 1), almost my job(average job counts in parallel are 20) are failed. almost error types are like java.io.IOException: Bad connect ack with firstBadLink as 10.25.241.101:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error while doing final merge at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_869.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) at org.apache.hadoop.mapreduce.task.reduce.MergeMa currenly, I suspect this is caused by limitations of hadoop to support output file descriptor count. (I am using a linux server to support this job, server configuration is $ cat /proc/sys/fs/file-max 327680 -- Junyoung Kim (juneng...@gmail.com)
Re: how many output files can support by MultipleOutputs?
hi, yifeng. Coung I know which version of a hadoop you are using? thanks for your response. Junyoung Kim (juneng...@gmail.com) On 02/21/2011 10:28 AM, Yifeng Jiang wrote: Hi, We have met the same issue. It seems that this error occurs, when the threads connected to the Datanode reaches the maximum # of server threads, defined by dfs.datanode.max.xcievers in hdfs-site.xml Our solution is to increase the it from the default value (256) to a bigger one, such as 2048. On 2011/02/21 10:17, Jun Young Kim wrote: hi, in an application, I read many files in many directories. additionally, by using MultipleOutputs class, I try to write thousands of output files in many directories. during reduce processing(reduce task count is 1), almost my job(average job counts in parallel are 20) are failed. almost error types are like java.io.IOException: Bad connect ack with firstBadLink as 10.25.241.101:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error while doing final merge at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_869.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) at org.apache.hadoop.mapreduce.task.reduce.MergeMa currenly, I suspect this is caused by limitations of hadoop to support output file descriptor count. (I am using a linux server to support this job, server configuration is $ cat /proc/sys/fs/file-max 327680
Re: How to package multiple jars for a Hadoop job
hi, There is a maven plugin to package for a hadoop. I think this is quite convenient tool to package for a hadoop. if you are using it, add this one to your pom.xml plugin groupIdcom.github.maven-hadoop.plugin/groupId artifactIdmaven-hadoop-plugin/artifactId version0.20.1/version configuration hadoopHomeyour_hadoop_home_dir/hadoopHome /configuration /plugin Junyoung Kim (juneng...@gmail.com) On 02/19/2011 07:23 AM, Eric Sammer wrote: Mark: You have a few options. You can: 1. Package dependent jars in a lib/ directory of the jar file. 2. Use something like Maven's assembly plugin to build a self contained jar. Either way, I'd strongly recommend using something like Maven to build your artifacts so they're reproducible and in line with commonly used tools. Hand packaging files tends to be error prone. This is less of a Hadoop-ism and more of a general Java development issue, though. On Fri, Feb 18, 2011 at 5:18 PM, Mark Kerznermarkkerz...@gmail.com wrote: Hi, I have a script that I use to re-package all the jars (which are output in a dist directory by NetBeans) - and it structures everything correctly into a single jar for running a MapReduce job. Here it is below, but I am not sure if it is the best practice. Besides, it hard-codes my paths. I am sure that there is a better way. #!/bin/sh # to be run from the project directory cd ../dist jar -xf MR.jar jar -cmf META-INF/MANIFEST.MF /home/mark/MR.jar * cd ../bin echo Repackaged for Hadoop Thank you, Mark
Re: how many output files can support by MultipleOutputs?
now, I am using a hadoop version 0.20.0. I have one more question about this configuration. before setting dfs.datanode.max.xcievers, I couldn't find out this one in job.xml. is this hidden configuration? why I could find out this one in my job.xml? thanks. Junyoung Kim (juneng...@gmail.com) On 02/21/2011 10:47 AM, Yifeng Jiang wrote: We were using 0.20.2 when the issue occurred, then we set it to 2048, and the failure was fixed. Now we are using 0.20-append (HBase requires it), it works well too. On 2011/02/21 10:35, Jun Young Kim wrote: hi, yifeng. Coung I know which version of a hadoop you are using? thanks for your response. Junyoung Kim (juneng...@gmail.com) On 02/21/2011 10:28 AM, Yifeng Jiang wrote: Hi, We have met the same issue. It seems that this error occurs, when the threads connected to the Datanode reaches the maximum # of server threads, defined by dfs.datanode.max.xcievers in hdfs-site.xml Our solution is to increase the it from the default value (256) to a bigger one, such as 2048. On 2011/02/21 10:17, Jun Young Kim wrote: hi, in an application, I read many files in many directories. additionally, by using MultipleOutputs class, I try to write thousands of output files in many directories. during reduce processing(reduce task count is 1), almost my job(average job counts in parallel are 20) are failed. almost error types are like java.io.IOException: Bad connect ack with firstBadLink as 10.25.241.101:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error while doing final merge at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:159) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_869.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:182) at org.apache.hadoop.mapreduce.task.reduce.MergeMa currenly, I suspect this is caused by limitations of hadoop to support output file descriptor count. (I am using a linux server to support this job, server configuration is $ cat /proc/sys/fs/file-max 327680
Re: how many output files can support by MultipleOutputs?
Hi, harsh I thought all my configuration to run a hadoop are listed in a job configuration. even if user didn't set properties explicitly, a hadoop set it defaultly. that means all properties should be listed in a job configuration. isn't it right? Junyoung Kim (juneng...@gmail.com) On 02/21/2011 11:40 AM, Harsh J wrote: Hello, On Mon, Feb 21, 2011 at 8:01 AM, Jun Young Kimjuneng...@gmail.com wrote: now, I am using a hadoop version 0.20.0. I have one more question about this configuration. before setting dfs.datanode.max.xcievers, I couldn't find out this one in job.xml. That is because the property does not exist in the hdfs-default.xml file, present in hadoop's jars. I don't know the reason behind that (since it is unavailable as a default inside 0.21 either). Also, it is a DN property, not a Job-specific one (can't be changed). Setting it into hdfs-site.xml should be sufficient.
I got errors from hdfs about DataStreamer Exceptions.
hi, all. I got errors from hdfs. 2011-02-18 11:21:29[WARN ][DFSOutputStream.java]run()(519) : DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:832) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 2011-02-18 11:21:29[WARN ][DFSOutputStream.java]setupPipelineForAppendOrRecovery()(730) : Could not get block locations. Source file /user/test/51/output/ehshop00newsvc-r-0 - Aborting... 2011-02-18 11:21:29[WARN ][Child.java]main()(234) : Exception running child : java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 2011-02-18 11:21:29[INFO ][Task.java]taskCleanup()(996) : Runnning cleanup for the task I think this one is also not different error. org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: blk_-2325764274016776017_8292 file=/user/test/51/input/kids.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:559) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:367) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:514) at java.io.DataInputStream.read(DataInputStream.java:83) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:138) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:465) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:90) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) -- I've checked the file '/user/test/51/input/kids.txt ', but, there is not strange ones. this file is healthy. Does anybody know about this error? How could I fix this one? thanks. -- Junyoung Kim (juneng...@gmail.com)
Re: I got errors from hdfs about DataStreamer Exceptions.
hi, harsh. you're always giving a response very quickly. ;) I am using a version 0.21.0 now. before asking about this problem, I've checked already file system healthy. $ hadoop fsck / . . Status: HEALTHY Total size:24231595038 B Total dirs:43818 Total files: 41193 (Files currently being written: 2178) Total blocks (validated): 40941 (avg. block size 591866 B) (Total open file blocks (not validated): 224) Minimally replicated blocks: 40941 (100.0 %) Over-replicated blocks:1 (0.0024425392 %) Under-replicated blocks: 2 (0.0048850784 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:2 Average block replication: 2.1106226 Corrupt blocks:0 Missing replicas: 4 (0.00462904 %) Number of data-nodes: 8 Number of racks: 1 The filesystem under path '/' is HEALTHY additionally, I found a little different error. here it is. java.io.IOException: Bad connect ack with firstBadLink as 10.25.241.107:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) here is my execution environment. average job count : 20 max map capacity : 128 max reduce capacity : 128 avg/slot per node : 32 avg input file size per job : 200M ~ 1G thanks. Junyoung Kim (juneng...@gmail.com) On 02/18/2011 11:43 AM, Harsh J wrote: You may want to check your HDFS health stat via 'fsck' (http://namenode/fsck or `hadoop fsck`). There may be a few corrupt files or bad DNs. Would also be good to know what exact version of Hadoop you're running. On Fri, Feb 18, 2011 at 7:59 AM, Jun Young Kimjuneng...@gmail.com wrote: hi, all. I got errors from hdfs. 2011-02-18 11:21:29[WARN ][DFSOutputStream.java]run()(519) : DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:832) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 2011-02-18 11:21:29[WARN ][DFSOutputStream.java]setupPipelineForAppendOrRecovery()(730) : Could not get block locations. Source file /user/test/51/output/ehshop00newsvc-r-0 - Aborting... 2011-02-18 11:21:29[WARN ][Child.java]main()(234) : Exception running child : java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:113) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:881) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 2011-02-18 11:21:29[INFO ][Task.java]taskCleanup()(996) : Runnning cleanup for the task I think this one is also not different error. org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: blk_-2325764274016776017_8292 file=/user/test/51/input/kids.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:559) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:367) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:514) at java.io.DataInputStream.read(DataInputStream.java:83) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:138) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:465) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:90) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) -- I've checked the file '/user/test/51/input/kids.txt ', but, there is not strange ones. this file is healthy. Does anybody know about this error? How could I fix this one? thanks. -- Junyoung Kim (juneng...@gmail.com)
Re: Selecting only few slaves in the cluster
you can use a fair-scheduler library to use only some parts of nodes you have to run a job. by using max/min map/reduce job counts. here is the documentation you can reference. http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair_scheduler.html Junyoung Kim (juneng...@gmail.com) On 02/16/2011 06:33 AM, praveen.pe...@nokia.com wrote: Hello all, We have a 100 node hadoop cluster that is used for multiple purposes. I want to run few mapred jobs and I know 4 to 5 slaves should be enough. Is there anyway to restrict my jobs to use only 4 slaves instead of all 100. I noticed that more the number of slaves more overhead there is. Also can I pass in hadoop parameters like mapred.child.java.opts so that the actual child processes gets the specified value for max heap size. I want to set the heap size to 2G instead of going with the default.. Thanks Praveen
Re: Which strategy is proper to run an this enviroment?
In a similar way, could I set all directories in an input at one? (not combine them in a single directory?) Currently, it's not easy to process at an one time all because the generated times of all directories are quite different. but, periodically, we can set many directories as an input for a hadoop. anyway, I've tested about 11000 directories to get M/R outputs. total running time : 6H 25M almost Jobs are done in minutes. Junyoung Kim (juneng...@gmail.com) On 02/13/2011 04:33 AM, Ted Dunning wrote: This sounds like it will be very inefficient. There is considerable overhead in starting Hadoop jobs. As you describe it, you will be starting thousands of jobs and paying this penalty many times. Is there a way that you could process all of the directories in one map-reduce job? Can you combine these directories into a single directory with a few large files? On Fri, Feb 11, 2011 at 8:07 PM, Jun Young Kimjuneng...@gmail.com wrote: Hi. I have small clusters (9 nodes) to run a hadoop here. Under this cluster, a hadoop will take thousands of directories sequencely. In a each dir, there is two input files to m/r. Size of input files are from 1m to 5g bytes. In a summary, each hadoop job will take an one of these dirs. To get best performance, which strategy is proper for us? Could u suggest me about it? Which configuration is best? Ps) physical memory size is 12g of each node.
Could I write outputs in multiple directories?
Hi, As I understand, a Hadoop can write multiple files in a directory. but, it can't write output files in multiple directories. isn't it? MultipleOutputs for generating multiple files. FileInputFormat.addInputPaths for setting several input files simultaneously. How could I do if I want to write outputs files in multiple directories depends on it's key? for example) A type key - MMdd/A/output B type Key - MMdd/B/output C type Key - MMdd/C/output thanks. -- Junyoung Kim (juneng...@gmail.com)
Which strategy is proper to run an this enviroment?
Hi. I have small clusters (9 nodes) to run a hadoop here. Under this cluster, a hadoop will take thousands of directories sequencely. In a each dir, there is two input files to m/r. Size of input files are from 1m to 5g bytes. In a summary, each hadoop job will take an one of these dirs. To get best performance, which strategy is proper for us? Could u suggest me about it? Which configuration is best? Ps) physical memory size is 12g of each node.
Is there any smart ways to give arguments to mappers reducers from a main job?
Hi, all in my job, I wanna pass some arguments to mappers and reducers from a main job. I googled some references to do that by using Configuration. but, it's not working. code) job) Configuration conf = new Configuration(); conf.set(test, value); mapper) doMap() extends Mapper... { System.out.println(context.getConfiguration.get(test)); /// -- this printed out null } How could I do that to make it working?-- Junyoung Kim (juneng...@gmail.com)
Re: why is it invalid to have non-alphabet characters as a result of MultipleOutputs?
OK. thanks for your replies. I decided to use '00' as a delimiter. :( Junyoung Kim (juneng...@gmail.com) On 02/09/2011 01:46 AM, David Rosenstrauch wrote: On 02/08/2011 05:01 AM, Jun Young Kim wrote: Hi, Multipleoutputs supports to have named outputs as a result of a hadoop. but, it has inconvenient restrictions to have it. only, alphabet characters are valid as a named output. A ~ Z a ~ z 0 ~ 9 are only characters we can take. I believe if I can use other chars like '.', '_', it could be more convenient for me. There's already a bug report open for this. https://issues.apache.org/jira/browse/MAPREDUCE-2293 DR
why is it invalid to have non-alphabet characters as a result of MultipleOutputs?
Hi, Multipleoutputs supports to have named outputs as a result of a hadoop. but, it has inconvenient restrictions to have it. only, alphabet characters are valid as a named output. A ~ Z a ~ z 0 ~ 9 are only characters we can take. I believe if I can use other chars like '.', '_', it could be more convenient for me. -- Junyoung Kim (juneng...@gmail.com)
Re: Could not add a new data node without rebooting Hadoop system
how about to use to reset for your new network topology? $ hadoop dfsadmin -refreshNodes Junyoung Kim (juneng...@gmail.com) On 02/07/2011 09:16 PM, Harsh J wrote: On Mon, Feb 7, 2011 at 5:16 PM, ahnahneui...@gmail.com wrote: Hello everybody 1. configure conf/slaves and *.xml files on master machine 2. configure conf/master and *.xml files on slave machine 'slaves' and 'masters' file are generally only required in the master machine, and only if you are using the start-* scripts supplied with Hadoop for use with SSH (FAQ has an entry on this) from master. 3. run ${HADOOP}/bin/hadoop datanode But when I ran the commands on the master node, the master node was recognized as a data node. 3. wasn't a valid command in this case. start-dfs.sh When I ran the commands on the data node which I want to add, the data node was not properly added.(The number of total data node didn't show any change) What do the logs say for the DataNode on the slave? Does it start successfully? If fs.default.name is set properly in slave's core-site.xml it should be able to communicate properly if started (and if the version is not mismatched).
problem to use MultipleOutputs on a ver-0.21.0
Hi, I am using now hadoop version 0.21.0. AYK, this version supports to use MultipleOutputs class to reduce outputs in several files. but, in my case, there is nothing in files. (just empty files) here is my code. main class) MultipleOutputs.addNamedOutput(job, FeederConfig.INSERT_OUTPUT_NAME, TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(job, FeederConfig.DELETE_OUTPUT_NAME, TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(job, FeederConfig.UPDATE_OUTPUT_NAME, TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(job, FeederConfig.NOTCHANGE_OUTPUT_NAME, TextOutputFormat.class, Text.class, Text.class); mapper) nothing to do for this job. just write keys and values reducer) ... multipleOutputs.write(getOutputFileName(code), new Text(key), new Text(value)); context.write(new Text(key), new Text(value)); ... private String getOutputFileName(String code) { String retFileName = ; if (code.equals(EPComparedResult.INSERT.getCode())) { retFileName = FeederConfig.INSERT_OUTPUT_NAME; } else if (code.equals(EPComparedResult.DELETE.getCode())) { retFileName = FeederConfig.DELETE_OUTPUT_NAME; } else if (code.equals(EPComparedResult.UPDATE.getCode())) { retFileName = FeederConfig.UPDATE_OUTPUT_NAME; } else { retFileName = FeederConfig.NOTCHANGE_OUTPUT_NAME; } return retFileName; } ... result) $ hadoop fs -ls output 11/02/07 13:09:13 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 11/02/07 13:09:13 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id Found 4 items -rw-r--r-- 2 irteam supergroup 0 2011-01-31 19:59 /user/test/output/DELETE-r-0 -rw-r--r-- 2 irteam supergroup 0 2011-01-31 19:59 /user/test/output/INSERT-r-0 -rw-r--r-- 2 irteam supergroup 0 2011-01-31 18:53 /user/test/output/_SUCCESS -rw-r--r-- 2 irteam supergroup 649622 2011-01-31 18:53 /user/test/output/part-r-0 -- Junyoung Kim (juneng...@gmail.com)
Re: mapred.child.java.opts not working correctly
after running a hadoop, it starts to collect configuration information from $CLASSPATH. even if you set up in your configurations, it could be overwritten by a hadoop. to avoid this problem, you SHOULD write your configuration with final ../final. for example. namemapred.child.java.opts value-xmx1600m/value finaltrue/final this link is also telling us about a same solution. http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F Junyoung Kim (juneng...@gmail.com) On 02/04/2011 12:00 PM, praveen.pe...@nokia.com wrote: Hello all, I am using Hadoop 0.20.2 along with Whirr on the cloud. I set mapred.child.java.opts to -Xmx1600m but I am seeing all the mapred task process has virtual memory between 480m and 500m. I am wondering if there is any other parameter that is overwriting this property. I am also not sure if this is a Whirr issue or Hadoop but I verified that hadoop-site.xml has this property value correct set. Thanks Praveen
Too small initial heap problem.
Hi, I have 9 cluster (1 master, 8 slaves) to run a hadoop. when I executed my job in a master, I got the following errors. 11/01/28 10:58:01 INFO mapred.JobClient: Running job: job_201101271451_0011 11/01/28 10:58:02 INFO mapred.JobClient: map 0% reduce 0% 11/01/28 10:58:08 INFO mapred.JobClient: Task Id : attempt_201101271451_0011_m_41_0, Status : FAILED java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) 11/01/28 10:58:08 WARN *mapred.JobClient: Error reading task output*http://hatest03.server:50060/tasklog?plaintext=truetaskid=attempt_201101271451_0011_m_41_0filter=stdout 11/01/28 10:58:08 WARN *mapred.JobClient: Error reading task output*http://hatest03.server:50060/tasklog?plaintext=truetaskid=attempt_201101271451_0011_m_41_0filter=stderr after going the hatest03.server, I've checked the directory which is named attempt_201101271451_0011_m_41_0. there is an error msg in a stdout file. Error occurred during initialization of VM Too small initial heap my configuration to use a heap size is property name mapred.child.java.opts /name value -Xmx1024 /value /property and the physical memory size is free -m $ free -m total used free shared buffers cached Mem: 12001 4711 7290 0197 4056 -/+ buffers/cache: 457 11544 Swap: 20470 2047 how can I fix this problem? -- Junyoung Kim (juneng...@gmail.com)
*site.xml didn't affect it's configuration.
Hi, I've set io.sort.mb to 400 in $HADOOP_HOME/conf/core-site.xml like this. property namemapreduce.task.io.sort.mb/name value400/value /property property namedfs.block.size/name value536870912/value /property but, after running my jar application I found the following result in a logs/job_2010*_conf.xml ... property name io.sort.mb /name value 100 /value /property property name fs.s3.block.size /name value 67108864 /value /property ... other things are all different what I set. why my configuration didn't affect the running environment? -- Junyoung Kim (juneng...@gmail.com)
how to get a core-site.xml info from a java application?
Hi, I am a beginner of a hadoop. now I want to know a way to get my configuration information which are defined in *.xml on my applications. for example) $HADOOP_HOME/conf/core-site.xml property name fs.default.name/name valuehdfs://localhost:54310/value /property How I can use the fs.default.name information in my application. this is my source code. Configuration conf = new Configuration(); System.out.println(conf.getString(fs.default.name)); // print nothing. How I can ?? -- Junyoung Kim (juneng...@gmail.com)
I couldn't find out job histories in a jobtracker page.
Hi, I am a beginner user of a hadoop. almost examples to learn hadoop suggest to use a jar style to use a hadoop framework. (like workcount.jar) in this case, I could find out a job history. but, if I execute my application as a java application (not a jar file). I could't find out job histories in a jobtracker page. and also I've set up two nodes as a hadoop cluster. however, my java application looks like using just a single node, not a two nodes to run my own sample. so. to track my job history, do I need to create jar files always? -- Junyoung Kim (juneng...@gmail.com)
Re: error compiling hadoop-mapreduce
maybe you've missed to set up class paths normally. check your path information to include all symbols. Junyoung Kim (juneng...@gmail.com) On 01/22/2011 02:08 AM, Edson Ramiro wrote: Hi all, I'm compiling hadoop from git using these instructions [1]. The hadoop-common and hadoop-hdfs are okay, they compile without erros, but when I execute ant mvn-install to compile hadoop-mapreduce I get this error. compile-mapred-test: [javac] /home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/build.xml:602: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 179 source files to /home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/build/test/mapred/classes [javac] /home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestMRServerPorts.java:84: cannot find symbol [javac] symbol : variable NAME_NODE_HOST [javac] TestHDFSServerPorts.NAME_NODE_HOST + 0); [javac]^ [javac] /home/lbd/hadoop/hadoop-ramiro/hadoop-mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestMRServerPorts.java:86: cannot find symbol [javac] symbol : variable NAME_NODE_HTTP_HOST [javac] location: class org.apache.hadoop.hdfs.TestHDFSServerPorts [javac] TestHDFSServerPorts.NAME_NODE_HTTP_HOST + 0); [javac]^ ... Is that a bug? This is my build.properties #this is essential resolvers=internal #you can increment this number as you see fit version=0.22.0-alpha-1 project.version=${version} hadoop.version=${version} hadoop-core.version=${version} hadoop-hdfs.version=${version} hadoop-mapred.version=${version} Other question, Is the 0.22.0-alpha-1 the latest version? Thanks in advance, [1] https://github.com/apache/hadoop-mapreduce -- Edson Ramiro Lucas Filho {skype, twitter, gtalk}: erlfilho http://www.inf.ufpr.br/erlf07/
Re: I couldn't find out job histories in a jobtracker page.
my application is quite simple. a) it reads from files in a directory. b) it calls a map reduce function to compare it's data of input files. c) it writes a result of comparison of input files to a output file. here is my code. .. job class.. Configuration sConf = new Configuration(); GenericOptionsParser sParser = new GenericOptionsParser(sConf, aArgs); Job sJob = null; String[] sOtherArgs = sParser.getRemainingArgs(); sJob = new Job(sConf, EPComparatorJob); log.info(sJob = + sJob); sJob.setJarByClass(EPComparatorJob.class); sJob.setMapOutputKeyClass(Text.class); sJob.setMapOutputValueClass(Text.class); sJob.setOutputKeyClass(Text.class); sJob.setOutputValueClass(Text.class); sJob.setInputFormatClass(TextInputFormat.class); if (sOtherArgs.length != 2) { printUsage(); System.exit(1); } log.info(setInput Output paths); FileInputFormat.setInputPaths(sJob, new Path(sOtherArgs[0])); FileOutputFormat.setOutputPath(sJob, new Path(sOtherArgs[1])); MultipleOutputs.addNamedOutput(sJob, HadoopConfig.INSERT_OUTPUT_NAME, TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(sJob, HadoopConfig.DELETE_OUTPUT_NAME, TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(sJob, HadoopConfig.UPDATE_OUTPUT_NAME, TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(sJob, HadoopConfig.NOTCHANGE_OUTPUT_NAME, TextOutputFormat.class, Text.class, Text.class); log.info(setMapperClass); sJob.setMapperClass(EPComparatorMapper.class); log.info(setReducerClass); sJob.setReducerClass(EPComparatorReducer.class); log.info(setNumReduceTasks); sJob.setNumReduceTasks(REDUCE_MAPTASKS_COUNTS); return (sJob.waitForCompletion(true) == true ? 0 : 1); .. map class.. ... protected void map(WritableComparableText aKey, Text aValue, Context aContext) throws IOException, InterruptedException { String info = aValue.toString(); String[] fields = info.split(HadoopConfig.EP_DATA_DELIMETER, 2); // input file name Path file = ((FileSplit)aContext.getInputSplit()).getPath(); String key = fields[0].trim(); String value = fields[1].trim() + HadoopConfig.EP_DATA_DELIMETER + file; aContext.write(new Text(key), new Text(value)); }; ... .. reduce class.. ... protected void reduce(WritableComparableText key, IterableText values, Context context) throws IOException, InterruptedException { String[] ret = getComparedResult(key, values, context); String code = ret[0]; String keyMapid = ret[1]; String valueInfo = ret[2]; multipleOutputs.write(new Text(code), new Text(keyMapid + HadoopConfig.EP_DATA_DELIMETER + valueInfo), getOutputFileName(code)); } ... I got the email from another user of a hadoop about this problem. the point of the email is we need to deploy an application as a jar style to use job trackers, not a java application itself. because to run mapreduce functions on slaves(cluster), we NEED to run a hadoop with a jar. thanks. Junyoung Kim (juneng...@gmail.com) On 01/25/2011 02:30 AM, Aman wrote: Not 100% sure of what your java program does but it looks like in your java application, you are not using Job Tracker in any way. It will help of you can post the nature of your java program Jun Young Kim wrote: Hi, I am a beginner user of a hadoop. almost examples to learn hadoop suggest to use a jar style to use a hadoop framework. (like workcount.jar) in this case, I could find out a job history. but, if I execute my application as a java application (not a jar file). I could't find out job histories in a jobtracker page. and also I've set up two nodes as a hadoop cluster. however, my java application looks like using just a single node, not a two nodes to run my own sample. so. to track my job history, do I need to create jar files always? -- Junyoung Kim (juneng...@gmail.com)
have a problem to run a hadoop with a jar.
Hi, I got this error when I executed a hadoop with a my jar application. $ hadoop jar test-hdeploy.jar Test Exception in thread main java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:301) at org.apache.hadoop.mapred.JobClient.getUGI(JobClient.java:679) at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:429) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:423) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:410) at org.apache.hadoop.mapreduce.Job.init(Job.java:50) at org.apache.hadoop.mapreduce.Job.init(Job.java:54) at com.naver.shopping.feeder.hadoop.EPComparatorJob.run(EPComparatorJob.java:78) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.naver.shopping.feeder.hadoop.EPComparatorJob.main(EPComparatorJob.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) a hadoop already has dependecies with slf libraries. (slf4j-log4j12-1.4.3.jar, slf4j-api-1.4.3.jar) so my jar file doesn't need to include it. do you know how I can fix it? -- Junyoung Kim (juneng...@gmail.com)
Re: have a problem to run a hadoop with a jar.
I found the reasons. it's the reason that it is using old library. hadoop version of slf is 1.4.x. so, I've replaced it with the latest version of it. (1.6.1) now, there is no problems to execute it. thanks. Junyoung Kim (juneng...@gmail.com) On 01/25/2011 11:56 AM, li ping wrote: It is a NoSuchMethodError error. Perhaps, the jar that you are using does not contain the method. Please double check it. On Tue, Jan 25, 2011 at 10:44 AM, Jun Young Kimjuneng...@gmail.com wrote: Hi, I got this error when I executed a hadoop with a my jar application. $ hadoop jar test-hdeploy.jar Test Exception in thread main java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:301) at org.apache.hadoop.mapred.JobClient.getUGI(JobClient.java:679) at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:429) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:423) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:410) at org.apache.hadoop.mapreduce.Job.init(Job.java:50) at org.apache.hadoop.mapreduce.Job.init(Job.java:54) at com.naver.shopping.feeder.hadoop.EPComparatorJob.run(EPComparatorJob.java:78) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.naver.shopping.feeder.hadoop.EPComparatorJob.main(EPComparatorJob.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) a hadoop already has dependecies with slf libraries. (slf4j-log4j12-1.4.3.jar, slf4j-api-1.4.3.jar) so my jar file doesn't need to include it. do you know how I can fix it? -- Junyoung Kim (juneng...@gmail.com)
Re: How to replace Jetty-6.1.14 with Jetty 7 in Hadoop?
Hi, this is little bit different question about Jetty. defaultly, Jetty is writing it's log into /tmp directory. Do you know how I can change the directory path? thanks - Junyoung Kim (juneng...@gmail.com) On 01/19/2011 07:34 PM, Steve Loughran wrote: On 18/01/11 19:58, Koji Noguchi wrote: Try moving up to v 6.1.25, which should be more straightforward. FYI, when we tried 6.1.25, we got hit by a deadlock. http://jira.codehaus.org/browse/JETTY-1264 Koji Interesting. Given that there is now 6.1.26 out, that would be the one to play with. Thanks for the heads up, I will move my code up to the .26 release, -steve
MultipleOutputs is not working on 0.20.2 properly.
Hi, I am using Hadoop 0.20.2 version on my cluster. To write multiple output files from a reducer, I want to use MultipleOutputs class. in this class, I need to call addNamedOutput. addNamedOutput public static void*addNamedOutput*(JobConf http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html conf, String http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true namedOutput, Class http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? extendsOutputFormat http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/OutputFormat.html outputFormatClass, Class http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? keyClass, Class http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? valueClass) Adds a named output for the job. *Parameters:* |conf|- job conf to add the named output |namedOutput|- named output name, it has to be a word, letters and numbers only, cannot be the word 'part' as that is reserved for the default output. |outputFormatClass|- OutputFormat class. |keyClass|- key class |valueClass|- value class As you see, this method takes JobConf type as a first argument. but, this one is deprecated one in 0.20.2. additionally, MultipleOuputs class is only stored in org.apache.hadoop.mapred.lib.MultipleOutputs. (not in org.apache.hadoop.mapred*uce*.lib.MultipleOutputs) this is related discussions about this problem. https://issues.apache.org/jira/browse/HADOOP-3149 https://issues.apache.org/jira/browse/MAPREDUCE-370 How I can set multiple output on my version? thanks. -- - Junyoung Kim (juneng...@gmail.com)
Re: MultipleOutputs is not working on 0.20.2 properly.
As I know, there is a maven repository to use 0.21.0. the cloudera, riptano are also supporting only 0.20.x versions. is there any repository to 0.21.x version of a hadoop? thanks. -- Junyoung Kim (juneng...@gmail.com) On 01/20/2011 07:58 PM, Harsh J wrote: The MAPREDUCE-370 is fixed in 0.21 releases of Hadoop. You can use/upgrade-to that release if it is no trouble. If it is of any help, the deprecated MapReduce API in 0.20.2 has been unmarked as so in the upcoming 0.20.3 (and is back as the stable API, while new API is marked evolving/unstable) and is perfectly okay to use without worrying about any deprecation (it is even supported in 0.21). Otherwise, you can consider switching to Cloudera's Distribution for Hadoop [CDH] (From http://cloudera.com) or other such distributions that have the mentioned patches back-ported to 0.20.x; if you wish to stick to the 0.20.x releases. I know for a fact that the current CDH2 and CDH3 releases have the new API MultipleOutputs support (and some more fixes).
Re: MultipleOutputs is not working on 0.20.2 properly.
anyway, the cloudera's version (0.20.2-737) is working. ;) -- Junyoung Kim (juneng...@gmail.com) On 01/20/2011 07:58 PM, Harsh J wrote: The MAPREDUCE-370 is fixed in 0.21 releases of Hadoop. You can use/upgrade-to that release if it is no trouble. If it is of any help, the deprecated MapReduce API in 0.20.2 has been unmarked as so in the upcoming 0.20.3 (and is back as the stable API, while new API is marked evolving/unstable) and is perfectly okay to use without worrying about any deprecation (it is even supported in 0.21). Otherwise, you can consider switching to Cloudera's Distribution for Hadoop [CDH] (From http://cloudera.com) or other such distributions that have the mentioned patches back-ported to 0.20.x; if you wish to stick to the 0.20.x releases. I know for a fact that the current CDH2 and CDH3 releases have the new API MultipleOutputs support (and some more fixes).