Re: Can't access filename in mapper?

2012-06-14 Thread Michael Parker
Thanks for the prompt reply, this worked like a charm!

- Mike


On Wed, Jun 13, 2012 at 10:51 PM, Harsh J ha...@cloudera.com wrote:
 Hey Mike,

 There is a much easier way to do this. We've answered a very similar
 question in detail before at: http://search-hadoop.com/m/ZOmmJ1PZJqt1
 (Question has a way for the stable/old API, and my response has the
 way for new API). Does this help?

 On Thu, Jun 14, 2012 at 8:24 AM, Michael Parker
 michael.g.par...@gmail.com wrote:
 Hi all,

 I'm new to Hadoop MR and decided to make a go at using only the new
 API. I have a series of log files (who doesn't?), where a different
 date is encoded in each filename. The log files are so few that I'm
 not using HDFS. In my main method, I accept the input directory
 containing all the log files as the first command line argument:

  Configuration conf = new Configuration();
  String[] otherArgs = new GenericOptionsParser(conf, 
 args).getRemainingArgs();
  Path inputDir = new Path(otherArgs[0]);
  ...
  Job job1 = new Job(conf, job1);
  FileInputFormat.addInputPath(job1, inputDir);

 I actually have two jobs chained using a JobControl, but I think
 that's irrelevant. The problem is that the Mapper of this job cannot
 get the filename by accessing key mapred.input.file of the Context
 object that is either passed to the setup method of the mapper, or
 available through the Context object in the call to map. Dumping the
 configuration like so:

  StringWriter writer = new StringWriter();
  Configuration.dumpConfiguration(context.getConfiguration(), writer);
  System.out.println(configuration= + writer.toString());

 Reveals that there is a mapred.input.dir key that contains the path
 passed as a command line argument and assigned to inputDir in my main
 method, but the processed filename within that path is still
 inaccessible. Any ideas how to get this?

 Thanks,
 Mike



 --
 Harsh J


codec compression ratio

2012-06-14 Thread Marek Miglinski
When procession 65billion records and using LZO or Snappy codecs, disk IO is at 
100% because mappers are spilling all the time, but CPU is at 40%. Is there a 
setting where I can raise compression ratio for map/reduce internal temp data 
(for LZO or Snappy)? So that I can raise effort on CPU and lower IO? Google 
didn't gave any ideas...


Thanks.
Marek M.


RE: codec compression ratio

2012-06-14 Thread Tim Broberg
Have you considered deflate or bzip?

- Tim.


From: Marek Miglinski [mmiglin...@seven.com]
Sent: Thursday, June 14, 2012 1:39 AM
To: mapreduce-user@hadoop.apache.org
Subject: codec compression ratio

When procession 65billion records and using LZO or Snappy codecs, disk IO is at 
100% because mappers are spilling all the time, but CPU is at 40%. Is there a 
setting where I can raise compression ratio for map/reduce internal temp data 
(for LZO or Snappy)? So that I can raise effort on CPU and lower IO? Google 
didn't gave any ideas...


Thanks.
Marek M.

The information contained in this email is intended only for the personal and 
confidential use of the recipient(s) named above.  The information and any 
attached documents contained in this message may be Exar confidential and/or 
legally privileged.  If you are not the intended recipient, you are hereby 
notified that any review, use, dissemination or reproduction of this message is 
strictly prohibited and may be unlawful.  If you have received this 
communication in error, please notify us immediately by return email and delete 
the original message.


Error reading task output

2012-06-14 Thread Shamshad Ansari
Hi All,
When I run hadoop jobs, I observe the following errors. Also, I notice that
data node dies every time  the job is initiated.

Does any one know what may be causing this and how to solve this?

==

12/06/14 19:57:17 INFO input.FileInputFormat: Total input paths to process
: 1
12/06/14 19:57:17 INFO mapred.JobClient: Running job: job_201206141136_0002
12/06/14 19:57:18 INFO mapred.JobClient:  map 0% reduce 0%
12/06/14 19:57:27 INFO mapred.JobClient: Task Id :
attempt_201206141136_0002_m_01_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

12/06/14 19:57:27 WARN mapred.JobClient: Error reading task
outputhttp://node1
:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdouthttp://apiamaa01hdfs03.apixio.com:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdout
12/06/14 19:57:27 WARN mapred.JobClient: Error reading task
outputhttp://node1
:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stderrhttp://apiamaa01hdfs03.apixio.com:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stderr
12/06/14 19:57:33 INFO mapred.JobClient: Task Id :
attempt_201206141136_0002_r_02_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

12/06/14 19:57:33 WARN mapred.JobClient: Error reading task
outputhttp://node1
:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stdouthttp://apiamaa01hdfs03.apixio.com:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stdout
12/06/14 19:57:33 WARN mapred.JobClient: Error reading task
outputhttp://node1
:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stderrhttp://apiamaa01hdfs03.apixio.com:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stderr
^Chadoop@ip-10-174-87-251:~/apixio-pipeline/pipeline-trigger$ 12/06/14
19:57:27 WARN mapred.JobClient: Error reading task outputhttp:/node1
:50060/sklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdouthttp://apiamaa01hdfs03.apixio.com:50060/sklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdout

Thank you,
--Shamshad


Re: Error reading task output

2012-06-14 Thread Harsh J
Do you ship a lot of dist-cache files or perhaps have a bad
mapred.child.java.opts parameter?

On Fri, Jun 15, 2012 at 1:39 AM, Shamshad Ansari sans...@apixio.com wrote:
 Hi All,
 When I run hadoop jobs, I observe the following errors. Also, I notice that
 data node dies every time  the job is initiated.

 Does any one know what may be causing this and how to solve this?

 ==

 12/06/14 19:57:17 INFO input.FileInputFormat: Total input paths to process :
 1
 12/06/14 19:57:17 INFO mapred.JobClient: Running job: job_201206141136_0002
 12/06/14 19:57:18 INFO mapred.JobClient:  map 0% reduce 0%
 12/06/14 19:57:27 INFO mapred.JobClient: Task Id :
 attempt_201206141136_0002_m_01_0, Status : FAILED
 java.lang.Throwable: Child Error
         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

 12/06/14 19:57:27 WARN mapred.JobClient: Error reading task
 outputhttp://node1:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdout
 12/06/14 19:57:27 WARN mapred.JobClient: Error reading task
 outputhttp://node1:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stderr
 12/06/14 19:57:33 INFO mapred.JobClient: Task Id :
 attempt_201206141136_0002_r_02_0, Status : FAILED
 java.lang.Throwable: Child Error
         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

 12/06/14 19:57:33 WARN mapred.JobClient: Error reading task
 outputhttp://node1:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stdout
 12/06/14 19:57:33 WARN mapred.JobClient: Error reading task
 outputhttp://node1:50060/tasklog?plaintext=trueattemptid=attempt_201206141136_0002_r_02_0filter=stderr
 ^Chadoop@ip-10-174-87-251:~/apixio-pipeline/pipeline-trigger$ 12/06/14
 19:57:27 WARN mapred.JobClient: Error reading task
 outputhttp:/node1:50060/sklog?plaintext=trueattemptid=attempt_201206141136_0002_m_01_0filter=stdout

 Thank you,
 --Shamshad




-- 
Harsh J


Re: Passing key-value pairs between chained jobs?

2012-06-14 Thread Kasi Subrahmanyam
HI MIcheal,
The problem for the second question can be solved if you use the
SequenceFileOutputFormat for the first job output and the
SequenceFileInputFormat for the second job input.

On Thu, Jun 14, 2012 at 11:11 PM, Michael Parker michael.g.par...@gmail.com
 wrote:

 Hi all,

 One more question. I have two jobs to run serially using a JobControl.
 The key-value types for the outputs of the reducer of the first job
 are ActiveDayKey, Text, where ActiveDayKey is a class that
 implements WritableComparable. And so the key-value types for the
 inputs to the mapper of the second job are ActiveDayKey, Text. I'm
 noticing two things:

 First, in the output of the reducer from the first job, each
 ActiveDayKey object is being written as a string using its toString
 method. Since it's a subclass of WritableComparable that already knows
 how to serialize itself using write(DataOuptut), is there any way to
 exploit that to write it in binary format? Otherwise, do I need to
 write a subclass of FileOutputFormat?

 Second, the second job fails with java.lang.ClassCastException:
 org.apache.hadoop.io.LongWritable cannot be cast to
 co.adhoclabs.LogProcessor$ActiveDayKey. I'm assuming this is because
 by default the key type is Long for the line number, and here I want
 to ignore the line number and use the ActiveDayKey written on the line
 itself as the key. Again, since ActiveDayKey knows how to deserialize
 itself using readFields(DataInput), is there any way to exploit that
 to read it from the line in binary format? Do I need to write a
 subclass of FileInputFormat?

 Assuming I need to write subclasses of FileOutputFormat and
 FileInputFormat classes, what's a good example of this? The terasort
 example?

 Thanks,
 Mike