Re: Can't access filename in mapper?
Thanks for the prompt reply, this worked like a charm! - Mike On Wed, Jun 13, 2012 at 10:51 PM, Harsh J ha...@cloudera.com wrote: Hey Mike, There is a much easier way to do this. We've answered a very similar question in detail before at: http://search-hadoop.com/m/ZOmmJ1PZJqt1 (Question has a way for the stable/old API, and my response has the way for new API). Does this help? On Thu, Jun 14, 2012 at 8:24 AM, Michael Parker michael.g.par...@gmail.com wrote: Hi all, I'm new to Hadoop MR and decided to make a go at using only the new API. I have a series of log files (who doesn't?), where a different date is encoded in each filename. The log files are so few that I'm not using HDFS. In my main method, I accept the input directory containing all the log files as the first command line argument: Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); Path inputDir = new Path(otherArgs[0]); ... Job job1 = new Job(conf, job1); FileInputFormat.addInputPath(job1, inputDir); I actually have two jobs chained using a JobControl, but I think that's irrelevant. The problem is that the Mapper of this job cannot get the filename by accessing key mapred.input.file of the Context object that is either passed to the setup method of the mapper, or available through the Context object in the call to map. Dumping the configuration like so: StringWriter writer = new StringWriter(); Configuration.dumpConfiguration(context.getConfiguration(), writer); System.out.println(configuration= + writer.toString()); Reveals that there is a mapred.input.dir key that contains the path passed as a command line argument and assigned to inputDir in my main method, but the processed filename within that path is still inaccessible. Any ideas how to get this? Thanks, Mike -- Harsh J
codec compression ratio
When procession 65billion records and using LZO or Snappy codecs, disk IO is at 100% because mappers are spilling all the time, but CPU is at 40%. Is there a setting where I can raise compression ratio for map/reduce internal temp data (for LZO or Snappy)? So that I can raise effort on CPU and lower IO? Google didn't gave any ideas... Thanks. Marek M.
RE: codec compression ratio
Have you considered deflate or bzip? - Tim. From: Marek Miglinski [mmiglin...@seven.com] Sent: Thursday, June 14, 2012 1:39 AM To: mapreduce-user@hadoop.apache.org Subject: codec compression ratio When procession 65billion records and using LZO or Snappy codecs, disk IO is at 100% because mappers are spilling all the time, but CPU is at 40%. Is there a setting where I can raise compression ratio for map/reduce internal temp data (for LZO or Snappy)? So that I can raise effort on CPU and lower IO? Google didn't gave any ideas... Thanks. Marek M. The information contained in this email is intended only for the personal and confidential use of the recipient(s) named above. The information and any attached documents contained in this message may be Exar confidential and/or legally privileged. If you are not the intended recipient, you are hereby notified that any review, use, dissemination or reproduction of this message is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by return email and delete the original message.
Error reading task output
Hi All, When I run hadoop jobs, I observe the following errors. Also, I notice that data node dies every time the job is initiated. Does any one know what may be causing this and how to solve this? == 12/06/14 19:57:17 INFO input.FileInputFormat: Total input paths to process : 1 12/06/14 19:57:17 INFO mapred.JobClient: Running job: job_201206141136_0002 12/06/14 19:57:18 INFO mapred.JobClient: map 0% reduce 0% 12/06/14 19:57:27 INFO mapred.JobClient: Task Id : attempt_201206141136_0002_m_01_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 12/06/14 19:57:27 WARN mapred.JobClient: Error reading task outputhttp://node1 :50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdouthttp://apiamaa01hdfs03.apixio.com:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdout 12/06/14 19:57:27 WARN mapred.JobClient: Error reading task outputhttp://node1 :50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stderrhttp://apiamaa01hdfs03.apixio.com:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stderr 12/06/14 19:57:33 INFO mapred.JobClient: Task Id : attempt_201206141136_0002_r_02_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 12/06/14 19:57:33 WARN mapred.JobClient: Error reading task outputhttp://node1 :50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stdouthttp://apiamaa01hdfs03.apixio.com:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stdout 12/06/14 19:57:33 WARN mapred.JobClient: Error reading task outputhttp://node1 :50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stderrhttp://apiamaa01hdfs03.apixio.com:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stderr ^Chadoop@ip-10-174-87-251:~/apixio-pipeline/pipeline-trigger$ 12/06/14 19:57:27 WARN mapred.JobClient: Error reading task outputhttp:/node1 :50060/sklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdouthttp://apiamaa01hdfs03.apixio.com:50060/sklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdout Thank you, --Shamshad
Re: Error reading task output
Do you ship a lot of dist-cache files or perhaps have a bad mapred.child.java.opts parameter? On Fri, Jun 15, 2012 at 1:39 AM, Shamshad Ansari sans...@apixio.com wrote: Hi All, When I run hadoop jobs, I observe the following errors. Also, I notice that data node dies every time the job is initiated. Does any one know what may be causing this and how to solve this? == 12/06/14 19:57:17 INFO input.FileInputFormat: Total input paths to process : 1 12/06/14 19:57:17 INFO mapred.JobClient: Running job: job_201206141136_0002 12/06/14 19:57:18 INFO mapred.JobClient: map 0% reduce 0% 12/06/14 19:57:27 INFO mapred.JobClient: Task Id : attempt_201206141136_0002_m_01_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 12/06/14 19:57:27 WARN mapred.JobClient: Error reading task outputhttp://node1:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdout 12/06/14 19:57:27 WARN mapred.JobClient: Error reading task outputhttp://node1:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stderr 12/06/14 19:57:33 INFO mapred.JobClient: Task Id : attempt_201206141136_0002_r_02_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 12/06/14 19:57:33 WARN mapred.JobClient: Error reading task outputhttp://node1:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stdout 12/06/14 19:57:33 WARN mapred.JobClient: Error reading task outputhttp://node1:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stderr ^Chadoop@ip-10-174-87-251:~/apixio-pipeline/pipeline-trigger$ 12/06/14 19:57:27 WARN mapred.JobClient: Error reading task outputhttp:/node1:50060/sklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdout Thank you, --Shamshad -- Harsh J
Re: Passing key-value pairs between chained jobs?
HI MIcheal, The problem for the second question can be solved if you use the SequenceFileOutputFormat for the first job output and the SequenceFileInputFormat for the second job input. On Thu, Jun 14, 2012 at 11:11 PM, Michael Parker michael.g.par...@gmail.com wrote: Hi all, One more question. I have two jobs to run serially using a JobControl. The key-value types for the outputs of the reducer of the first job are ActiveDayKey, Text, where ActiveDayKey is a class that implements WritableComparable. And so the key-value types for the inputs to the mapper of the second job are ActiveDayKey, Text. I'm noticing two things: First, in the output of the reducer from the first job, each ActiveDayKey object is being written as a string using its toString method. Since it's a subclass of WritableComparable that already knows how to serialize itself using write(DataOuptut), is there any way to exploit that to write it in binary format? Otherwise, do I need to write a subclass of FileOutputFormat? Second, the second job fails with java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to co.adhoclabs.LogProcessor$ActiveDayKey. I'm assuming this is because by default the key type is Long for the line number, and here I want to ignore the line number and use the ActiveDayKey written on the line itself as the key. Again, since ActiveDayKey knows how to deserialize itself using readFields(DataInput), is there any way to exploit that to read it from the line in binary format? Do I need to write a subclass of FileInputFormat? Assuming I need to write subclasses of FileOutputFormat and FileInputFormat classes, what's a good example of this? The terasort example? Thanks, Mike