Re: How to stop a mapper within a map-reduce job when you detect bad input

2010-10-21 Thread ed
Hello,

The MapRunner classes looks promising.  I noticed it is in the deprecated
mapred package but I didn't see an equivalent class in the mapreduce
package.  Is this going to ported to mapreduce or is it no longer being
supported?  Thanks!

~Ed

On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.com wrote:

 If it occurs eventually as your record reader reads it, then you may
 use a MapRunner class instead of a Mapper IFace/Subclass. This way,
 you may try/catch over the record reader itself, and call your map
 function only on valid next()s. I think this ought to work.

 You can set it via JobConf.setMapRunnerClass(...).

 Ref: MapRunner API @

 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html

 On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote:
  Hello,
 
  I have a simple map-reduce job that reads in zipped files and converts
 them
  to lzo compression.  Some of the files are not properly zipped which
 results
  in Hadoop throwing an java.io.EOFException: Unexpected end of input
 stream
  error and causes the job to fail.  Is there a way to catch this
 exception
  and tell hadoop to just ignore the file and move on?  I think the
 exception
  is being thrown by the class reading in the Gzip file and not my mapper
  class.  Is this correct?  Is there a way to handle this type of error
  gracefully?
 
  Thank you!
 
  ~Ed
 



 --
 Harsh J
 www.harshj.com



Re: How to stop a mapper within a map-reduce job when you detect bad input

2010-10-21 Thread ed
Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
before) and it doesn't look like MapRunner is deprecated so I'll try
catching the error there and will report back if it's a good solution.
Thanks!

~Ed

On Thu, Oct 21, 2010 at 11:23 AM, ed hadoopn...@gmail.com wrote:

 Hello,

 The MapRunner classes looks promising.  I noticed it is in the deprecated
 mapred package but I didn't see an equivalent class in the mapreduce
 package.  Is this going to ported to mapreduce or is it no longer being
 supported?  Thanks!

 ~Ed


 On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.com wrote:

 If it occurs eventually as your record reader reads it, then you may
 use a MapRunner class instead of a Mapper IFace/Subclass. This way,
 you may try/catch over the record reader itself, and call your map
 function only on valid next()s. I think this ought to work.

 You can set it via JobConf.setMapRunnerClass(...).

 Ref: MapRunner API @

 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html

 On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote:
  Hello,
 
  I have a simple map-reduce job that reads in zipped files and converts
 them
  to lzo compression.  Some of the files are not properly zipped which
 results
  in Hadoop throwing an java.io.EOFException: Unexpected end of input
 stream
  error and causes the job to fail.  Is there a way to catch this
 exception
  and tell hadoop to just ignore the file and move on?  I think the
 exception
  is being thrown by the class reading in the Gzip file and not my mapper
  class.  Is this correct?  Is there a way to handle this type of error
  gracefully?
 
  Thank you!
 
  ~Ed
 



 --
 Harsh J
 www.harshj.com





Re: How to stop a mapper within a map-reduce job when you detect bad input

2010-10-21 Thread ed
Sorry to keep spamming this thread.  It looks like the correct way to
implement MapRunnable using the new mapreduce classes (instead of the
deprecated mapred) is to override the run() method of the mapper class.
This is actually nice and convenient since everyone should already be using
Mapper class (org.apache.hadoop.mapreduce.MaperKEYIN, VALUEIN, KEYOUT,
VALUEOUT for their mappers.

~Ed

On Thu, Oct 21, 2010 at 12:14 PM, ed hadoopn...@gmail.com wrote:

 Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
 before) and it doesn't look like MapRunner is deprecated so I'll try
 catching the error there and will report back if it's a good solution.
 Thanks!

 ~Ed


 On Thu, Oct 21, 2010 at 11:23 AM, ed hadoopn...@gmail.com wrote:

 Hello,

 The MapRunner classes looks promising.  I noticed it is in the deprecated
 mapred package but I didn't see an equivalent class in the mapreduce
 package.  Is this going to ported to mapreduce or is it no longer being
 supported?  Thanks!

 ~Ed


 On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.com wrote:

 If it occurs eventually as your record reader reads it, then you may
 use a MapRunner class instead of a Mapper IFace/Subclass. This way,
 you may try/catch over the record reader itself, and call your map
 function only on valid next()s. I think this ought to work.

 You can set it via JobConf.setMapRunnerClass(...).

 Ref: MapRunner API @

 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html

 On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote:
  Hello,
 
  I have a simple map-reduce job that reads in zipped files and converts
 them
  to lzo compression.  Some of the files are not properly zipped which
 results
  in Hadoop throwing an java.io.EOFException: Unexpected end of input
 stream
  error and causes the job to fail.  Is there a way to catch this
 exception
  and tell hadoop to just ignore the file and move on?  I think the
 exception
  is being thrown by the class reading in the Gzip file and not my mapper
  class.  Is this correct?  Is there a way to handle this type of error
  gracefully?
 
  Thank you!
 
  ~Ed
 



 --
 Harsh J
 www.harshj.com






Re: How to stop a mapper within a map-reduce job when you detect bad input

2010-10-21 Thread ed
Thanks Tom! Didn't see your post before posting =)

On Thu, Oct 21, 2010 at 1:28 PM, ed hadoopn...@gmail.com wrote:

 Sorry to keep spamming this thread.  It looks like the correct way to
 implement MapRunnable using the new mapreduce classes (instead of the
 deprecated mapred) is to override the run() method of the mapper class.
 This is actually nice and convenient since everyone should already be using
 Mapper class (org.apache.hadoop.mapreduce.MaperKEYIN, VALUEIN, KEYOUT,
 VALUEOUT for their mappers.

 ~Ed


 On Thu, Oct 21, 2010 at 12:14 PM, ed hadoopn...@gmail.com wrote:

 Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
 before) and it doesn't look like MapRunner is deprecated so I'll try
 catching the error there and will report back if it's a good solution.
 Thanks!

 ~Ed


 On Thu, Oct 21, 2010 at 11:23 AM, ed hadoopn...@gmail.com wrote:

 Hello,

 The MapRunner classes looks promising.  I noticed it is in the deprecated
 mapred package but I didn't see an equivalent class in the mapreduce
 package.  Is this going to ported to mapreduce or is it no longer being
 supported?  Thanks!

 ~Ed


 On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.com wrote:

 If it occurs eventually as your record reader reads it, then you may
 use a MapRunner class instead of a Mapper IFace/Subclass. This way,
 you may try/catch over the record reader itself, and call your map
 function only on valid next()s. I think this ought to work.

 You can set it via JobConf.setMapRunnerClass(...).

 Ref: MapRunner API @

 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html

 On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote:
  Hello,
 
  I have a simple map-reduce job that reads in zipped files and converts
 them
  to lzo compression.  Some of the files are not properly zipped which
 results
  in Hadoop throwing an java.io.EOFException: Unexpected end of input
 stream
  error and causes the job to fail.  Is there a way to catch this
 exception
  and tell hadoop to just ignore the file and move on?  I think the
 exception
  is being thrown by the class reading in the Gzip file and not my
 mapper
  class.  Is this correct?  Is there a way to handle this type of error
  gracefully?
 
  Thank you!
 
  ~Ed
 



 --
 Harsh J
 www.harshj.com







Re: How to stop a mapper within a map-reduce job when you detect bad input

2010-10-21 Thread ed
I overwrote the run() method in the mapper with a run() method (below) that
catches the EOFException.  The mapper and reducer now complete but the
outputted lzo file from the reducer throws an Unexpected End of File error
when decompressing it indicating something did not clean up properly.  I
can't think of why this could be happening as the map() method should only
be called on input that was properly decompressed (anything that can't be
decompressed will throw an Exception that is being caught).  The reducer
then should not even know that the mapper hit an EOFException in the input
gzip file, and yet the output lzo file still has the unexpected end of file
problem (I'm using the kevinweil lzo libraries).  Is there some call that
needs to be made that will close out the mapper and ensure that the lzo
output from the reducer is formatted properly?  Thank you!

@Override
public void run(Context context) throw InterruptedException{
 try{
  setup(context);
  while(context.nextKeyValue()){
 map(context.getCurrentKey(), context.getCurrentValue(),
context);
   }
   cleanup(context);
  } catch(EOFException){
   logError(context, EOFException: Corrupt gzip file + mFileName);
  }
}


On Thu, Oct 21, 2010 at 1:29 PM, ed hadoopn...@gmail.com wrote:

 Thanks Tom! Didn't see your post before posting =)


 On Thu, Oct 21, 2010 at 1:28 PM, ed hadoopn...@gmail.com wrote:

 Sorry to keep spamming this thread.  It looks like the correct way to
 implement MapRunnable using the new mapreduce classes (instead of the
 deprecated mapred) is to override the run() method of the mapper class.
 This is actually nice and convenient since everyone should already be using
 Mapper class (org.apache.hadoop.mapreduce.MaperKEYIN, VALUEIN, KEYOUT,
 VALUEOUT for their mappers.

 ~Ed


 On Thu, Oct 21, 2010 at 12:14 PM, ed hadoopn...@gmail.com wrote:

 Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
 before) and it doesn't look like MapRunner is deprecated so I'll try
 catching the error there and will report back if it's a good solution.
 Thanks!

 ~Ed


 On Thu, Oct 21, 2010 at 11:23 AM, ed hadoopn...@gmail.com wrote:

 Hello,

 The MapRunner classes looks promising.  I noticed it is in the
 deprecated mapred package but I didn't see an equivalent class in the
 mapreduce package.  Is this going to ported to mapreduce or is it no longer
 being supported?  Thanks!

 ~Ed


 On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.comwrote:

 If it occurs eventually as your record reader reads it, then you may
 use a MapRunner class instead of a Mapper IFace/Subclass. This way,
 you may try/catch over the record reader itself, and call your map
 function only on valid next()s. I think this ought to work.

 You can set it via JobConf.setMapRunnerClass(...).

 Ref: MapRunner API @

 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html

 On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote:
  Hello,
 
  I have a simple map-reduce job that reads in zipped files and
 converts them
  to lzo compression.  Some of the files are not properly zipped which
 results
  in Hadoop throwing an java.io.EOFException: Unexpected end of input
 stream
  error and causes the job to fail.  Is there a way to catch this
 exception
  and tell hadoop to just ignore the file and move on?  I think the
 exception
  is being thrown by the class reading in the Gzip file and not my
 mapper
  class.  Is this correct?  Is there a way to handle this type of error
  gracefully?
 
  Thank you!
 
  ~Ed
 



 --
 Harsh J
 www.harshj.com








Re: Setting num reduce tasks

2010-10-21 Thread ed
You could also try

job.setNumReduceTasks(yourNumber);

~Ed

On Thu, Oct 21, 2010 at 4:45 PM, Alex Kozlov ale...@cloudera.com wrote:

 Hi Matt, it might be that the parameter does not end up in the final
 configuration for a number of reasons.  Can you check the job config xml in
 jt:/var/log/hadoop/history or in the JT UI and see what the
 mapred.reduce.tasks setting is?  -- Alex K

 On Thu, Oct 21, 2010 at 1:39 PM, Matt Tanquary matt.tanqu...@gmail.com
 wrote:

  I am using the following to set my number of reduce tasks, however when I
  run my job it's always using just 1 reducer.
 
  conf.setInt(mapred.reduce.tasks, 20);
 
  1 reducer will never finish this job. Please help me to understand why
 the
  setting I choose is not used.
 
  Thanks,
  -M@
 



LZO Compression Libraries don't appear to work properly with MultipleOutputs

2010-10-21 Thread ed
Hello everyone,

I am having problems using MultipleOutputs with LZO compression (could be a
bug or something wrong in my own code).

In my driver I set

 MultipleOutputs.addNamedOutput(job, test, TextOutputFormat.class,
NullWritable.class, Text.class);

In my reducer I have:

 MultipleOutputsNullWritable, Text mOutput = new
MultipleOutputsNullWritable, Text(context);

 public String generateFileName(Key key){
return custom_file_name;
 }

Then in the reduce() method I have:

 mOutput.write(mNullWritable, mValue, generateFileName(key));

This results in creating LZO files that do not decompress properly (lzop -d
throws the error lzop: unexpected end of file: outputFile.lzo)

If I switch back to the regular context.write(mNullWritable, mValue);
everything works fine.

Am I forgetting a step needed when using MultipleOutputs or is this a
bug/non-feature of using LZO compression in Hadoop.

Thank you!


~Ed


Re: How to stop a mapper within a map-reduce job when you detect bad input

2010-10-21 Thread ed
So the overwritten run() method was a red herring.  The real problem appears
to be that I use MultipleOutputs (the new mapreduce API version) for my
reducer output.  I posted a different thread since it's not really related
to the original question here.  For everyone that was curious, it turns our
overriding the run() method and catching the EOFException works beautifully
for processing files that might be corrupt or have errors. Thanks!

~Ed

On Thu, Oct 21, 2010 at 2:07 PM, ed hadoopn...@gmail.com wrote:

 I overwrote the run() method in the mapper with a run() method (below) that
 catches the EOFException.  The mapper and reducer now complete but the
 outputted lzo file from the reducer throws an Unexpected End of File error
 when decompressing it indicating something did not clean up properly.  I
 can't think of why this could be happening as the map() method should only
 be called on input that was properly decompressed (anything that can't be
 decompressed will throw an Exception that is being caught).  The reducer
 then should not even know that the mapper hit an EOFException in the input
 gzip file, and yet the output lzo file still has the unexpected end of file
 problem (I'm using the kevinweil lzo libraries).  Is there some call that
 needs to be made that will close out the mapper and ensure that the lzo
 output from the reducer is formatted properly?  Thank you!

 @Override
 public void run(Context context) throw InterruptedException{
  try{
   setup(context);
   while(context.nextKeyValue()){
  map(context.getCurrentKey(), context.getCurrentValue(),
 context);
}
cleanup(context);
   } catch(EOFException){
logError(context, EOFException: Corrupt gzip file +
 mFileName);

   }
 }


 On Thu, Oct 21, 2010 at 1:29 PM, ed hadoopn...@gmail.com wrote:

 Thanks Tom! Didn't see your post before posting =)


 On Thu, Oct 21, 2010 at 1:28 PM, ed hadoopn...@gmail.com wrote:

 Sorry to keep spamming this thread.  It looks like the correct way to
 implement MapRunnable using the new mapreduce classes (instead of the
 deprecated mapred) is to override the run() method of the mapper class.
 This is actually nice and convenient since everyone should already be using
 Mapper class (org.apache.hadoop.mapreduce.MaperKEYIN, VALUEIN, KEYOUT,
 VALUEOUT for their mappers.

 ~Ed


 On Thu, Oct 21, 2010 at 12:14 PM, ed hadoopn...@gmail.com wrote:

 Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs
 before) and it doesn't look like MapRunner is deprecated so I'll try
 catching the error there and will report back if it's a good solution.
 Thanks!

 ~Ed


 On Thu, Oct 21, 2010 at 11:23 AM, ed hadoopn...@gmail.com wrote:

 Hello,

 The MapRunner classes looks promising.  I noticed it is in the
 deprecated mapred package but I didn't see an equivalent class in the
 mapreduce package.  Is this going to ported to mapreduce or is it no 
 longer
 being supported?  Thanks!

 ~Ed


 On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.comwrote:

 If it occurs eventually as your record reader reads it, then you may
 use a MapRunner class instead of a Mapper IFace/Subclass. This way,
 you may try/catch over the record reader itself, and call your map
 function only on valid next()s. I think this ought to work.

 You can set it via JobConf.setMapRunnerClass(...).

 Ref: MapRunner API @

 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html

 On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote:
  Hello,
 
  I have a simple map-reduce job that reads in zipped files and
 converts them
  to lzo compression.  Some of the files are not properly zipped which
 results
  in Hadoop throwing an java.io.EOFException: Unexpected end of input
 stream
  error and causes the job to fail.  Is there a way to catch this
 exception
  and tell hadoop to just ignore the file and move on?  I think the
 exception
  is being thrown by the class reading in the Gzip file and not my
 mapper
  class.  Is this correct?  Is there a way to handle this type of
 error
  gracefully?
 
  Thank you!
 
  ~Ed
 



 --
 Harsh J
 www.harshj.com









Re: LZO Compression Libraries don't appear to work properly with MultipleOutputs

2010-10-21 Thread ed
Hi Todd,

I don't have the code in front of me right but I was looking over the API
docs and it looks like I forgot to call close() on the MultipleOutput.  I'll
post back if that fixes the problem.  If not I'll put together a unit test.
Thanks!

~Ed

On Thu, Oct 21, 2010 at 6:31 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Ed,

 Sounds like this might be a bug, either in MultipleOutputs or in LZO.

 Does it work properly with gzip compression? Which LZO implementation
 are you using? The one from google code or the more up to date one
 from github (either kevinweil's or mine)?

 Any chance you could write a unit test that shows the issue?

 Thanks
 -Todd

 On Thu, Oct 21, 2010 at 2:52 PM, ed hadoopn...@gmail.com wrote:
  Hello everyone,
 
  I am having problems using MultipleOutputs with LZO compression (could be
 a
  bug or something wrong in my own code).
 
  In my driver I set
 
  MultipleOutputs.addNamedOutput(job, test, TextOutputFormat.class,
  NullWritable.class, Text.class);
 
  In my reducer I have:
 
  MultipleOutputsNullWritable, Text mOutput = new
  MultipleOutputsNullWritable, Text(context);
 
  public String generateFileName(Key key){
 return custom_file_name;
  }
 
  Then in the reduce() method I have:
 
  mOutput.write(mNullWritable, mValue, generateFileName(key));
 
  This results in creating LZO files that do not decompress properly (lzop
 -d
  throws the error lzop: unexpected end of file: outputFile.lzo)
 
  If I switch back to the regular context.write(mNullWritable, mValue);
  everything works fine.
 
  Am I forgetting a step needed when using MultipleOutputs or is this a
  bug/non-feature of using LZO compression in Hadoop.
 
  Thank you!
 
 
  ~Ed
 



 --
 Todd Lipcon
 Software Engineer, Cloudera



Re: Upgrading Hadoop from CDH3b3 to CDH3

2010-10-20 Thread ed
I don't think there is a stable CDH3 yet although we've been using CDH3B2
and it has been pretty stable for us.  (at least I don't see it available on
their website and they JUST announced CDH3B3 last week at HadoopWorld.

~Ed


On Wed, Oct 20, 2010 at 5:57 AM, Abhinay Mehta abhinay.me...@gmail.comwrote:

 Hi all,

 We currently have Cloudera's Hadoop beta 3 installed on our cluster, we
 would like to upgrade to the latest stable release CDH3.
 Is there documentation or recommended steps on how to do this?

 We found some docs on how to upgrade from CDH2 and CDHb2 to CDHb3 here:

 https://docs.cloudera.com/display/DOC/Hadoop+Upgrade+from+CDH2+or+CDH3b2+to+CDH3b3
 Are the same steps recommended to upgrade to CDH3?

 I'm hoping it's a lot easier to upgrade from beta3 to the latest stable
 version than that document states?

 Thank you.
 Abhinay Mehta



Re: Reduce function

2010-10-19 Thread ed
Keys are partitioned among the reducers using a partition function which is
specified in the aptly named Partitioner class.  By default, Hadoop will
hash the key (and probably mods the hash by the number of reducers) to
determine which reducer to send your key to (I say probably because I
haven't looked at the actual code).  What this means for you is that if you
set a custom bit in the key field, keys with different bits are not
guaranteed to go to the same reducers even if they rest of key is the same.

For example

 Key1 = (DataX+BitA) -- Reducer1
 Key2 = (DataX+BitB) -- Reducer2

What you want is for any key with the same Data to go to the same reducer
regardless of the bit value.  To do this you need to write your own
partitioner class and set your job to use that class using

 job.setPartitionerClass(MyCustomPartitioner.class)

Your custom partitioner will need to break apart your key and only hash on
the DataX part of it.

The partitioner class is really easy to override and will look something
like this:

 public class MyCustomPartitioner extends PartitionerKey, Value {
  public int getPartition(Key key, Value value, int numPartitions){
  //split my key so that the bit flag is removed
  //take the modified key and mod it by numPartitions
  return the result
  }
 }

Of course Key and Value would be whatever Key and Value class you're using.

Hope that helps.

~Ed



On Mon, Oct 18, 2010 at 8:58 PM, Brad Tofel b...@archive.org wrote:

 Whoops, just re-read your message, and see you may be asking about
 targeting a reduce callback function, not a reduce task..

 If that's the case, I'm not sure I understand what your bit/tag is for,
 and what you're trying to do with it. Can you provide a concrete example
 (not necessarily code) of some keys which need to group together?

 Is there a way to embed the bit within the value, so keys are always
 common?

 If you really need to fake out the system so different keys arrive in the
 same reduce, you might be able to do it with a combination of:

 org.apache.hadoop.mapreduce.Job

 .setSortComparatorClass()
 .setGroupingComparatorClass()
 .setPartitionerClass()

 Brad


 On 10/18/2010 05:41 PM, Brad Tofel wrote:

 The Partitioner implementation used with your job should define which
 reduce target receives a given map output key.

 I don't know if an existing Partitioner implementation exists which meets
 your needs, but it's not a very complex interface to develop, if nothing
 existing works for you.

 Brad

 On 10/18/2010 04:43 PM, Shi Yu wrote:

 How many tags you have? If you have several number of tags, you'd better
 create a Vector class to hold those tags. And define sum function to
 increment the values of tags. Then the value class should be your new Vector
 class. That's better and more decent than the Textpair approach.

 Shi

 On 2010-10-18 5:19, Matthew John wrote:

 Hi all,

 I had a small doubt regarding the reduce module. What I understand is
 that
 after the shuffle / sort phase , all the records with the same key value
 goes into a reduce function. If thats the case, what is the attribute of
 the
 Writable key which ensures that all the keys go to the same reduce ?

 I am working on a reduce side Join where I need to tag all the keys with
 a
 bit which might vary but still want all those records to go into same
 reduce. In Hadoop the Definitive Guide, pg. 235 they are using  TextPair
 for
 the key. But I dont understand how the keys with different tag
 information
 goes into the same reduce.

 Matthew








Re: io.sort.mb maximum limit

2010-10-19 Thread ed
HI Donovan,

This is sort of tangential to your question but we tried upping our
io.sort.mb to a really high value and it actually resulted in slower
performance (I think we bumped it up to 1400MB and this was slower than
leaving it at 256 on a machine with 32GB of RAM).  I'm not entirely sure why
this was the case.  It could have been a garbage collection issue or some
other secondary effect that was slowing things down.

Keep in mind that Hadoop will always spill map outputs to disk no matter how
large your sort buffer is in case the reducer crashes, the data needs to
exist on disk somewhere for the next reducer making the attempt so it might
be counterproductive to try and eliminate spills.

~Ed


On Tue, Oct 19, 2010 at 8:02 AM, Donovan Hide donovanh...@gmail.com wrote:

 Hi,
 is there a reason why the io.sort.mb setting is hard-coded to the
 maximum of 2047MB?

 MapTask.java 789-791

   if ((sortmb  0x7FF) != sortmb) {
 throw new IOException(Invalid \io.sort.mb\:  + sortmb);
   }

 Given that the EC2 High-Memory Quadruple Extra Large Instance has
 68.4GB of memory and 8 cores, it would make sense to be able to set
 the io.sort.mb to close to 8GB. I have map task that outputs
 144,586,867 records of average size 12 bytes, and a greater than
 2047MB sort buffer would allow me to prevent the inevitable spills. I
 know I can reduce the size of the map inputs to solve the problem, but
 2047MB seems a bit arbitrary given the spec of EC2 instances.

 Cheers,
 Donovan.



How to stop a mapper within a map-reduce job when you detect bad input

2010-10-19 Thread ed
Hello,

I have a simple map-reduce job that reads in zipped files and converts them
to lzo compression.  Some of the files are not properly zipped which results
in Hadoop throwing an java.io.EOFException: Unexpected end of input stream
error and causes the job to fail.  Is there a way to catch this exception
and tell hadoop to just ignore the file and move on?  I think the exception
is being thrown by the class reading in the Gzip file and not my mapper
class.  Is this correct?  Is there a way to handle this type of error
gracefully?

Thank you!

~Ed


Re: Set number Reducer per machines.

2010-10-06 Thread ed
Ah yes,

It looks like both the mapper and reducer are using a map structure which
will be created on the heap.  All the values from the reducer are being
inserted into the map structure.  If you have lots of values for a single
key then you're going to run out of heap memory really fast.  Do you have a
rough estimate for the number of values per key?  We had this problem when
we first started using map-reduce (we'd create large arrays in the reducer
to hold data to sort).  Turns out this is generally a very bad idea (it's
particularly bad when the number of values per key is not bounded since
sometimes you're algorithm will work and other times you'll get out of
memory errors).  In our case we redesigned our algorithm to not require
holding lots of values in memory by taking advantage of Hadoop's sorting
capability and secondary sorting capability.

My guess is you won't be able to use the cloud9 mapper and reducer unless
your data changes so that the number of unique values per key is much
lower.  It's also possible that you're running out of heap space in the
mapper as your create the map there.  How many items are in the terms
array?  I

String[] terms = text.split(\\s+);

Sorry that's probably not much help to you.

~Ed

On Wed, Oct 6, 2010 at 8:04 AM, Pramy Bhats pramybh...@googlemail.comwrote:

 Hi Ed,
 I was using the following file for mapreduce job.

 Cloud9/src/dist/edu/umd/cloud9/example/cooccur/ComputeCooccurrenceMatrixStripes.java
 thanks,
 --Pramod

 On Tue, Oct 5, 2010 at 10:51 PM, ed hadoopn...@gmail.com wrote:

  What are the exact files you are using for the mapper and reducer from
 the
  cloud9 package?
 
  On Tue, Oct 5, 2010 at 2:15 PM, Pramy Bhats pramybh...@googlemail.com
  wrote:
 
   Hi Ed,
  
   I was trying to benchmark some application code available online.
   http://github.com/lintool/Cloud9
  
   For the program computing concurrentmatrix strips. However, the code
  itself
   is problematic because it throws heap-space error for even very small
  data
   sets.
  
   thanks,
   --Pramod
  
  
  
   On Tue, Oct 5, 2010 at 5:50 PM, ed hadoopn...@gmail.com wrote:
  
Hi Pramod,
   
How much memory does each node in your cluster have?
   
What type of processors do those nodes have? (dual core, quad core,
  dual
quad core? etc..)
   
In what step are you seeing the heap space error (mapper or reducer?)
   
It's quite possible that you're mapper or reducer code could be
  improved
   to
reduce heap space usage.
   
~Ed
   
On Tue, Oct 5, 2010 at 10:05 AM, Marcos Medrado Rubinelli 
marc...@buscape-inc.com wrote:
   
 You can set the mapred.tasktracker.map.tasks.maximum and
 mapred.tasktracker.reduce.tasks.maximum properties in your
mapred-site.xml
 file, but you may also want to check your current
   mapred.child.java.opts
and
 mapred.child.ulimit values to make sure they aren't overriding the
  4GB
you
 set globally.

 Cheers,
 Marcos

  Hi,

 I am trying to run a job on my hadoop cluster, where I get
   consistently
 get
 heap space error.

 I increased the heap-space to 4 GB in hadoop-env.sh and reboot the
 cluster.
 However, I still get the heap space error.


 One of things, I want to try is to reduce the number of map /
 reduce
 process
 per machine. Currently each machine can have 2 maps and 2 reduce
   process
 running.


 I want to configure the hadoop to run 1 map and 1 reduce per
 machine
   to
 give
 more heap space per process.

 How can I configure the number of maps and number of reducer per
  node
   ?


 thanks in advance,
 -- Pramod



   
  
 



Re: Read/Writing into HDFS

2010-09-30 Thread ed
I haven't tried it out yet but you theoretically can mount HDFS as a
standard file system in linux using Fuse

http://wiki.apache.org/hadoop/MountableHDFS

If you're using Cloudera's distro of Hadoop it should come with fuse
prepackaged for you:

https://wiki.cloudera.com/display/DOC/Mountable+HDFS

~Ed

On Thu, Sep 30, 2010 at 7:59 AM, Adarsh Sharma adarsh.sha...@orkash.comwrote:

 Dear all,
 I have set up a Hadoop cluster of 10 nodes.
 I want to know that how we can read/write file from HDFS (simple).
 Yes I know there are commands, i read the whole HDFS commands.
 bin/hadoop -copyFromLocal tells that the file should be in localfilesystem.

 But I want to know that how we can read these files from  the cluster.
 What are the different ways to read files from HDFS.
 *Can a extra node ( other than the cluster nodes )  read file from the
 cluster.
 If yes , how?

 *Thanks in Advance*
 *



Re: How to config Map only job to read .gz input files and output result in .lzo

2010-09-28 Thread ed
I've had luck doing the following in main (assuming lzo is setup properly)
(I'm using Hadoop 20.2)

 FileOutputFormat.setCompressOutput(job, true);
 FileOutputFormat.setOutputCompressorClass(job,
com.hadoop.compression.lzo.LzopCodec.class)

Make sure kevin weil's jar file is accessible when building your jar, and is
available on the cluster.
You should see Lzo being loaded each time you run a job at the beginning

Something like:

INFO lzo.GPLNaitveCodeLoader: Loaded native gpl library
INFO lzo.LzoCodec:  Succesfully loaded  initialized native-lzo library

(you should see both lines to make sure hadoop sees your jar and native
library)

Hope that works!

~Ed

On Tue, Sep 28, 2010 at 3:06 PM, Steve Kuo kuosen...@gmail.com wrote:

 We have TB worth of XML data in .gz format where each file is about 20 MB.
 This dataset is not expected to change.  My goal is to write a map-only job
 to read in one .gz file at a time and output the result in .lzo format.
 Since there are a large number of .gz files, the map parallelism is
 expected
 to be maximized.  I am using Kevin Weil's LZO distribution and there does
 not seem to be a LzoTextOutputFormat.  When I got lzo to work before, I set
 InputFormatClass to LzoTextInputFormat.class and map's output got lzo
 compressed automatically.  What does one configure for LZO output.

 Current Job configuration code listed below does not work.  XmlInputFormat
 is my custom input format to read XML files.

job.setInputFormatClass(XmlInputFormat.class);
job.setMapperClass(XmlAnalyzer.XmlAnalyzerMapper.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

String mapredOutputCompress = conf.get(mapred.output.compress);
if (true.equals(mapredOutputCompress))
// this reads input and write output in lzo format
job.setInputFormatClass(LzoTextInputFormat.class);



Re: Proper blocksize and io.sort.mb setting when using compressed LZO files

2010-09-27 Thread ed
Ah okay,

I did not the fs.inmemory.size.mb setting in any of the default config files
located here:

http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
http://hadoop.apache.org/common/docs/r0.20.2/core-default.html
http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html

Should this be something that needs to be added?

Thank you for the help!

~Ed

On Mon, Sep 27, 2010 at 11:18 AM, Ted Yu yuzhih...@gmail.com wrote:

 The setting should be fs.inmemory.size.mb

 On Mon, Sep 27, 2010 at 7:15 AM, pig hadoopn...@gmail.com wrote:

  HI Sriguru,
 
  Thank you for the tips.  Just to clarify a few things.
 
  Our machines have 32 GB of RAM.
 
  I'm planning on setting each machine to run 12 mappers and 2 reducers
 with
  the heap size set to 2048MB so total memory usage for the heap at 28GB.
 
  If this is the case should io.sort.mb be set to 70% of 2048MB (so ~1400
  MB)?
 
  Also, I did not see a fs.inmemorysize.mb setting in any of the hadoop
  configuration files.  Is that the correct setting I should be looking
 for?
  Should this also be set to 70% of the heap size or does it need to share
  with the io.sort.mb setting.
 
  I assume if I'm bumping up io.sort.mb that much I also need to increase
  io.sort.factor from the default of 10.  Is there a recommended relation
  between these two?
 
  Thank you for your help!
 
  ~Ed
 
  On Sun, Sep 26, 2010 at 3:05 AM, Srigurunath Chakravarthi 
  srig...@yahoo-inc.com wrote:
 
   Ed,
Tuning io.sort.mb will be certainly worthwhile if you have enough RAM
 to
   allow for a higher Java heap per map task without risking swapping.
  
Similarly, you can decrease spills on the reduce side using
   fs.inmemorysize.mb.
  
   You can use the following thumb rules for tuning those two:
  
   - Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80%
  RAM
   across all processes (maps, reducers, TT, DN, other)
   - Set it small enough to avoid swap activity, but
   - Set it large enough to minimize disk spills.
   - Ensure that io.sort.factor is set large enough to allow full use of
   buffer space.
   - Balance space for output records (default 95%)  record meta-data
 (5%).
   Use io.sort.spill.percent and io.sort.record.percent
  
Your mileage may vary. We've seen job exec time improvements worth
 1-3%
   via spill-avoidance for miscellaneous applications.
  
Your other option of running a map per 32MB or 64MB of input should
 give
   you better performance if your map task execution time is significant
  (i.e.,
   much larger than a few seconds) compared to the overhead of launching
 map
   tasks and reading input.
  
   Regards,
   Sriguru
  
   -Original Message-
   From: pig [mailto:hadoopn...@gmail.com]
   Sent: Saturday, September 25, 2010 2:36 AM
   To: common-user@hadoop.apache.org
   Subject: Proper blocksize and io.sort.mb setting when using compressed
   LZO files
   
   Hello,
   
   We just recently switched to using lzo compressed file input for our
   hadoop
   cluster using Kevin Weil's lzo library.  The files are pretty uniform
   in
   size at around 200MB compressed.  Our block size is 256MB.
   Decompressed the
   average LZO input file is around 1.0GB.  I noticed lots of our jobs
 are
   now
   spilling lots of data to disk.  We have almost 3x more spilled records
   than
   map input records for example.  I'm guessing this is because each
   mapper is
   getting a 200 MB lzo file which decompresses into 1GB of data per
   mapper.
   
   Would you recommend solving this by reducing the block size to 64MB,
 or
   even
   32MB and then using the LZO indexer so that a single 200MB lzo file is
   actually split among 3 or 4 mappers?  Would it be better to play with
   the
   io.sort.mb value?  Or, would it be best to play with both? Right now
   the
   io.sort.mb value is the default 200MB. Have other lzo users had to
   adjust
   their block size to compensate for the expansion of the data after
   decompression?
   
   Thank you for any help!
   
   ~Ed
  
 



Re: Reducer-side join example

2010-04-06 Thread Ed Kohlwey
Hi,
Your question has an academic sound, so I'll give it an academic answer ;).
Unfortunately, there are not really any good generalized (ie. cross join a
large matrix with a large matrix) methods for doing joins in map-reduce. The
fundamental reason for this is that in the general case you're comparing
everything to everything, and so for each pair of possible rows, you must
actually generate each pair of rows. This means every node ships all its
data to every other node, no matter what (in the general case). I bring this
up not because you're looking to optimize cross joining, but because it
demonstrates the point that you will exploit the characteristics of your
data no matter what strategy you choose, and each will have domain-specific
flaws and advantages.

The typical strategy for a reduce side join is to use hadoop's sorting
functionality to group rows by their keys, such that the entire data set for
a particular key will be resident on a single reducer. The key insight is
that you're thinking about the join as a sorting problem. Yes this means you
risk producing data sets that fill your reducers, but thats a trade-off that
you accept to reduce the complexity of the original problem.

If the existing join framework in hadoop (whose javadocs are quite thorough)
is inadequate, you shouldn't be afraid to invent, implement, and test join
strategies that are specific to your domain.


On Tue, Apr 6, 2010 at 11:01 AM, M B machac...@gmail.com wrote:

 Thanks, I appreciate the example - what happens if File A and B have many
 more columns (all different data types)?  The logic doesn't seem to work in
 that case - unless we set up the values in the Map function to include the
 file name (maybe the output value is a HashMap or something, which might
 work).

 Also, I was asking to see a reduce-side join as we have other things going
 on in the Mapper and I'm not sure if we can tweak it's output (we send
 output to multiple places).  Does anyone have an example using the
 contrib/DataJoin or something similar?

 thanks

 On Mon, Apr 5, 2010 at 7:03 PM, He Chen airb...@gmail.com wrote:

  For the Map function:
  Input key: default
  input value: File A and File B lines
 
  output key: A, B, C,(first colomn of the final result)
  output value: 12, 24, Car, 13, Van, SUV...
 
  Reduce function:
  take the Map output and do:
  for each key
  {   if the value of a key is integer
 then same it to array1;
else save it to array2
  }
  for ith element in array1
   for jth element in array2
output(key, array1[i]+\t+array2[j]);
  done
 
  Hope this helps.
 
 
  On Mon, Apr 5, 2010 at 4:10 PM, M B machac...@gmail.com wrote:
 
   Hi, I need a good java example to get me started with some joining we
  need
   to do, any examples would be appreciated.
  
   File A:
   Field1  Field2
   A12
   B13
   C22
   A24
  
   File B:
Field1  Field2   Field3
   ACar   ...
   BTruck...
   BSUV ...
   BVan  ...
  
   So, we need to first join File A and B on Field1 (say both are string
   fields).  The result would just be:
   A   12   Car   ...
   A   24   Car   ...
   B   13   Truck   ...
   B   13   SUV   ...
B   13   Van   ...
   and so on - with all the fields from both files returning.
  
   Once we have that, we sometimes need to then transform it so we have a
   single record per key (Field1):
   A (12,Car) (24,Car)
   B (13,Truck) (13,SUV) (13,Van)
   --however it looks, basically tuples for each key (we'll modify this
  later
   to return a conatenated set of fields from B, etc)
  
   At other times, instead of transforming to a single row, we just need
 to
   modify rows based on values.  So if B.Field2 equals Van, we need to
 set
   Output.Field2 = whatever then output to file ...
  
   Are there any good examples of this in native java (we can't use
   pig/hive/etc)?
  
   thanks.
  
 
 
 
  --
  Best Wishes!
 
 
  --
  Chen He
   PhD. student of CSE Dept.
  Holland Computing Center
  University of Nebraska-Lincoln
  Lincoln NE 68588
 



Re: question on shuffle and sort

2010-03-30 Thread Ed Mazur
On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
  Did all key-value pairs of the map output, which have the same key, will
 be sent to the same reducer tasknode?

Yes, this is at the core of the MapReduce model. There is one call to
the user reduce function per unique map output key. This grouping is
achieved by sorting which means you see keys in increasing order.

Ed


Re: Strange behavior regarding stout,stderr,syslog

2010-03-14 Thread Ed Mazur
On Sat, Mar 13, 2010 at 10:57 PM, patektek wrote:
 Hello,
 I am using hadoop-0.20.1. Something very strange is happening with the log
 files (stdout, stderr, syslog).
 Basically, no log files are created for most of the tasks (in
 HOME_HADOOP/logs/userlogs).  However, when I check the history for each
 individual task in the website interface I can see that all  the  data is
 available for each individual task.  Furthermore, the few log files
 available (for a few tasks) seem to have the log files from multiple tasks
 mingled together.

 Any hint?


Do you have JVM reuse enabled? With it enabled, I've observed that all
tasks associated with a particular JVM go to the same log.

Ed


Re: sort done parallel or after copy ?

2010-03-05 Thread Ed Mazur
Hi Prasen,

The data that reduce tasks receive during shuffle (copy) has already
been sorted by map tasks, so they just have to be merged.

This merge happens in parallel with the shuffle. When a reduce task's
in-memory buffer of sorted map output files reaches a certain
threshold, they are merged and written to disk. If the number of
on-disk files created through this process exceeds 2n-1 where
n=io.sort.factor (10 by default), n of these files get merged so that
there are n remaining. When the shuffle ends, there can be anywhere
between 0 to 2n-1 files on disk to be merged still. These get merged
down to (at most) n files and a final merge goes directly into the
user reduce function.

Ed

On Fri, Mar 5, 2010 at 12:36 AM, prasenjit mukherjee
prasen@gmail.com wrote:
 if I understand correctly reduce has 3 stages : copy,sort,reduce. Copy
 happens  parallely with mappers  still running. Reduce has to wait
 till all the mappers are done.

 For sorting we could have 2 options :
 1) Entire sorting happens after copy ( in a single shot ) OR
 2) It could happen along with copy where each block is sorted and
 later merged ( via merge-sort  )

 How is it being currently done in hadoop's latest version ?

 -Thanks,
 Prasen



Re: Writing a simple sort application for Hadoop

2010-02-28 Thread Ed Mazur
Hi Abhishek,

If you use input lines as your output keys in map, Hadoop internals
will do the work for you and the keys will appear in sorted order in
your reduce (you can use IdentityReducer). This needs a slight
adjustment if your input lines aren't unique.

If you have R reducers, this will create R sorted files. If you want a
single sorted file, you can merge the R files or use 1 reducer.
Another way is to use TotalOrderPartitioner which will ensure all keys
in reduce N come after all keys in reduce N-1.

Owen O'Malley and Arun C. Murthy's paper [1] about using Hadoop to win
a sorting competition might be of interest to you.

Ed

[1] http://sortbenchmark.org/Yahoo2009.pdf

On Sun, Feb 28, 2010 at 1:53 PM,  aa...@buffalo.edu wrote:
 Hello,
      I am trying to write a simple sorting application for hadoop. This is 
 what
 I have thought till now. Suppose I have 100 lines of data and 10 mappers, 
 each of
 the 10 mappers will sort the data given to it. But I am unable to figure out 
 is
 how to join these outputs to one big sorted array. In other words what should 
 be
 the code to be written in the reduce ?


 Best Regards from Buffalo

 Abhishek Agrawal

 SUNY- Buffalo
 (716-435-7122)






Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-24 Thread Ed Mazur
As you noticed, your map tasks are spilling three times as many
records as they are outputting. In general, if the map output buffer
is large enough to hold all records in memory, these values will be
equal. If there isn't enough room, as was the case with your job, the
buffer makes additional intermediate spills.

To fix this, you can try tuning the per-job configurables io.sort.mb
and io.sort.record.percent. Look at the counters of a few map tasks to
get an idea of how much data (io.sort.mb) and how many records
(io.sort.record.percent) they produce.

Ed

On Wed, Feb 24, 2010 at 2:45 AM, Tim Kiefer tim-kie...@gmx.de wrote:
 Sure,
 I see:
 Map input eecords: 10,000
 Map output records: 600,000
 Map output bytes: 307,216,800,000  (each reacord is about 500kb - that fits
 the application and is to be expected)

 Map spilled records: 1,802,965 (ahhh... now that you ask for it - here there
 also is a factor of 3 between output and spilled).

 So - question now is: why are three times as many records spilled than
 actually produced by the mappers?

 In my map function, I do not perform any additional file writing besides the
 context.write() for the intermediate records.

 Thanks, Tim

 Am 24.02.2010 05:28, schrieb Amogh Vasekar:

 Hi,
 Can you let us know what is the value for :
 Map input records
 Map spilled records
 Map output bytes
 Is there any side effect file written?

 Thanks,
 Amogh


 On 2/23/10 8:57 PM, Tim Kiefertim-kie...@gmx.de  wrote:

 No... 900GB is in the map column. Reduce adds another ~70GB of
 FILE_BYTES_WRITTEN and the total column consequently shows ~970GB.

 Am 23.02.2010 16:11, schrieb Ed Mazur:


 Hi Tim,

 I'm guessing a lot of these writes are happening on the reduce side.
 On the JT web interface, there are three columns: map, reduce,
 overall. Is the 900GB figure from the overall column? The value in the
 map column will probably be closer to what you were expecting. There
 are writes on the reduce side too during the shuffle and multi-pass
 merge.

 Ed

 2010/2/23 Tim Kiefertim-kie...@gmx.de:



 Hi Gang,

 thanks for your reply.

 To clarify: I look at the statistics through the job tracker. In the
 webinterface for my job I have columns for map, reduce and total. What I
 was refering to is map - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
 Output Bytes in the map column.

 About the replication factor: I would expect the exact same thing -
 changing to 6 has no influence on FILE_BYTES_WRITTEN.

 About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
 Furthermore, I have 40 mappers and map output data is ~300GB. I can't
 see how that ends up in a factor 3?

 - tim

 Am 23.02.2010 14:39, schrieb Gang Luo:



 Hi Tim,
 the intermediate data is materialized to local file system. Before it
 is available for reducers, mappers will sort them. If the buffer
 (io.sort.mb) is too small for the intermediate data, multi-phase sorting
 happen, which means you read and write the same bit more than one time.

 Besides, are you looking at the statistics per mapper through the job
 tracker, or just the information output when a job finish? If you look at
 the information given out at the end of the job, note that this is an
 overall statistics which include sorting at reduce side. It also include 
 the
 amount of data written to HDFS (I am not 100% sure).

 And, the FILE-BYTES_WRITTEN has nothing to do with the replication
 factor. I think if you change the factor to 6, FILE_BYTES_WRITTEN is still
 the same.

  -Gang


 Hi there,

 can anybody help me out on a (most likely) simple unclarity.

 I am wondering how intermediate key/value pairs are materialized. I
 have a job where the map phase produces 600,000 records and map output 
 bytes
 is ~300GB. What I thought (up to now) is that these 600,000 records, i.e.,
 300GB, are materialized locally by the mappers and that later on reducers
 pull these records (based on the key).
 What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter
 is as high as ~900GB.

 So - where does the factor 3 come from between Map output bytes and
 FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the
 file system - but that should be HDFS only?!

 Thanks
 - tim










Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-23 Thread Ed Mazur
Hi Tim,

I'm guessing a lot of these writes are happening on the reduce side.
On the JT web interface, there are three columns: map, reduce,
overall. Is the 900GB figure from the overall column? The value in the
map column will probably be closer to what you were expecting. There
are writes on the reduce side too during the shuffle and multi-pass
merge.

Ed

2010/2/23 Tim Kiefer tim-kie...@gmx.de:
 Hi Gang,

 thanks for your reply.

 To clarify: I look at the statistics through the job tracker. In the
 webinterface for my job I have columns for map, reduce and total. What I
 was refering to is map - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
 Output Bytes in the map column.

 About the replication factor: I would expect the exact same thing -
 changing to 6 has no influence on FILE_BYTES_WRITTEN.

 About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
 Furthermore, I have 40 mappers and map output data is ~300GB. I can't
 see how that ends up in a factor 3?

 - tim

 Am 23.02.2010 14:39, schrieb Gang Luo:
 Hi Tim,
 the intermediate data is materialized to local file system. Before it is 
 available for reducers, mappers will sort them. If the buffer (io.sort.mb) 
 is too small for the intermediate data, multi-phase sorting happen, which 
 means you read and write the same bit more than one time.

 Besides, are you looking at the statistics per mapper through the job 
 tracker, or just the information output when a job finish? If you look at 
 the information given out at the end of the job, note that this is an 
 overall statistics which include sorting at reduce side. It also include the 
 amount of data written to HDFS (I am not 100% sure).

 And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I 
 think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same.

  -Gang


 - 原始邮件 
 发件人: Tim Kiefer tim-kie...@gmx.de
 收件人: common-user@hadoop.apache.org common-user@hadoop.apache.org
 发送日期: 2010/2/23 (周二) 6:44:28 上午
 主   题: How are intermediate key/value pairs materialized between map and 
 reduce?

 Hi there,

 can anybody help me out on a (most likely) simple unclarity.

 I am wondering how intermediate key/value pairs are materialized. I have a 
 job where the map phase produces 600,000 records and map output bytes is 
 ~300GB. What I thought (up to now) is that these 600,000 records, i.e., 
 300GB, are materialized locally by the mappers and that later on reducers 
 pull these records (based on the key).
 What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as 
 high as ~900GB.

 So - where does the factor 3 come from between Map output bytes and 
 FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the 
 file system - but that should be HDFS only?!

 Thanks
 - tim



   ___
   好玩贺卡等你发,邮箱贺卡全新上线!
 http://card.mail.cn.yahoo.com/




Re: hadoop under cygwin issue

2010-02-03 Thread Ed Mazur
Brian,

It looks like you're confusing your local file system with HDFS. HDFS
sits on top of your file system and is where data for (non-standalone)
Hadoop jobs comes from. You can poll it with fs -ls ..., so do
something like hadoop fs -lsr / to see everything in HDFS. This will
probably shed some light on why your first attempt failed.
/user/brian/input should be a directory with several xml files.

Ed

On Wed, Feb 3, 2010 at 5:17 PM, Brian Wolf brw...@gmail.com wrote:
 Alex Kozlov wrote:

 Live Nodes http://localhost:50070/dfshealth.jsp#LiveNodes     :       0

 You datanode is dead.  Look at the logs in the $HADOOP_HOME/logs directory
 (or where your logs are) and check the errors.

 Alex K

 On Mon, Feb 1, 2010 at 1:59 PM, Brian Wolf brw...@gmail.com wrote:





 Thanks for your help, Alex,

 I managed to get past that problem, now I have this problem:

 However, when I try to run this example as stated on the quickstart webpage:

 bin/hadoop jar hadoop-*-examples.jar grep input  output 'dfs[a-z.]+'

 I get this error;
 =
 java.io.IOException:       Not a file:
 hdfs://localhost:9000/user/brian/input/conf
 =
 so it seems to default to my home directory looking for input it
 apparently  needs an absolute filepath, however, when I  run that way:

 $ bin/hadoop jar hadoop-*-examples.jar grep /usr/local/hadoop-0.19.2/input
  output 'dfs[a-z.]+'

 ==
 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
 hdfs://localhost:9000/usr/local/hadoop-0.19.2/input
 ==
 It still isn't happy although this part - /usr/local/hadoop-0.19.2/input
  -  does exist

 Aaron,

 Thanks or your help. I  carefully went through the steps again a couple
 times , and ran

 after this
 bin/hadoop namenode -format

 (by the way, it asks if I want to reformat, I've tried it both ways)


 then


 bin/start-dfs.sh

 and

 bin/start-all.sh


 and then
 bin/hadoop fs -put conf input

 now the return for this seemed cryptic:


 put: Target input/conf is a directory

 (??)

  and when I tried

 bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

 It says something about 0 nodes

 (from log file)

 2010-02-01 13:26:29,874 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
 ugi=brian,None,Administrators,Users    ip=/127.0.0.1    cmd=create

  src=/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
  dst=null    perm=brian:supergroup:rw-r--r--
 2010-02-01 13:26:30,045 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 3 on 9000, call

 addBlock(/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar,
 DFSClient_725490811) from 127.0.0.1:3003: error: java.io.IOException:
 File
 /cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
 could
 only be replicated to 0 nodes, instead of 1
 java.io.IOException: File
 /cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
 could
 only be replicated to 0 nodes, instead of 1
  at

 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1287)
  at

 org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)




 To maybe rule out something regarding ports or ssh , when I run netstat:

  TCP    127.0.0.1:9000         0.0.0.0:0              LISTENING
  TCP    127.0.0.1:9001         0.0.0.0:0              LISTENING


 and when I browse to http://localhost:50070/


    Cluster Summary

 * * * 21 files and directories, 0 blocks = 21 total. Heap Size is 8.01 MB
 /
 992.31 MB (0%)
 *
 Configured Capacity     :       0 KB
 DFS Used        :       0 KB
 Non DFS Used    :       0 KB
 DFS Remaining   :       0 KB
 DFS Used%       :       100 %
 DFS Remaining%  :       0 %
 Live Nodes http://localhost:50070/dfshealth.jsp#LiveNodes     :       0
 Dead Nodes http://localhost:50070/dfshealth.jsp#DeadNodes     :       0


 so I'm a bit still in the dark, I guess.

 Thanks
 Brian




 Aaron Kimball wrote:



 Brian, it looks like you missed a step in the instructions. You'll need
 to
 format the hdfs filesystem instance before starting the NameNode server:

 You need to run:

 $ bin/hadoop namenode -format

 .. then you can do bin/start-dfs.sh
 Hope this helps,
 - Aaron


 On Sat, Jan 30, 2010 at 12:27 AM, Brian Wolf brw...@gmail.com wrote:





 Hi,

 I am trying to run Hadoop 0.19.2 under cygwin as per directions on the
 hadoop quickstart web page.

 I know sshd is running and I can ssh localhost without a password.

 This is from my hadoop-site.xml

 configuration
 property

Re: Failed to install Hadoop on WinXP

2010-01-27 Thread Ed Mazur
I tried running 0.20.0 on XP too a few weeks ago and stuck at the same
spot. No problems with standalone mode. Any insight would be
appreciated, thanks.

Ed

On Wed, Jan 27, 2010 at 11:41 AM, Yura Taras yura.ta...@gmail.com wrote:
 Hi all
 I'm trying to deploy pseudo-distributed cluster on my devbox which
 runs under WinXP. I did following steps:
 1. Installed cygwin with ssh, configured ssh
 2. Downloaded hadoop and extracted it, set JAVA_HOME and HADOOP_HOME
 env vars (I made a symlink to java home, so it don't contain spaces)
 3. Adjusted conf/hadoop-env.sh to point to correct JAVA_HOME
 4. Adjusted conf files to following values:
   * core-site.xml:
 configuration
    property
      namehadoop.tmp.dir/name
      value/hdfs/hadoop/value
      descriptionA base for other temporary directories./description
    /property
    property
        namefs.default.name/name
        valuehdfs://localhost:/value
    /property
 /configuration

  * hdfs-site.xml:
 configuration
  property
    namedfs.replication/name
    value1/value
  /property
 /configuration

  * mapred-site.xml:
 configuration
  property
    namemapred.job.tracker/name
    valuelocalhost:/value
  /property
 /configuration

 5. Next I execute following line: $ bin/hadoop namenode -format 
 bin/start-all.sh  bin/hadoop fs -put conf input  bin/hadoop jar
 hadoop-0.20.1-examples.jar grep input output 'dfs[a-z.]'

 I receive following exception:
 localhost: starting tasktracker, logging to
 /home/ytaras/hadoop/bin/../logs/hadoop-ytaras-tasktracker-bueno.out
 10/01/27 18:23:55 INFO mapred.FileInputFormat: Total input paths to process : 
 13
 10/01/27 18:23:56 INFO mapred.JobClient: Running job: job_201001271823_0001
 10/01/27 18:23:57 INFO mapred.JobClient:  map 0% reduce 0%
 10/01/27 18:24:09 INFO mapred.JobClient: Task Id :
 attempt_201001271823_0001_m_14_0, Status : FAILED
 java.io.FileNotFoundException: File
 D:/hdfs/hadoop/mapred/local/taskTracker/jobcache/job_201001271823_0001/attempt_201001271823_0001_m_14_0/work/tmp
 does not exist.
        at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
        at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
        at 
 org.apache.hadoop.mapred.TaskRunner.setupWorkDir(TaskRunner.java:519)
        at org.apache.hadoop.mapred.Child.main(Child.java:155)

 !! SKIP - above exception few times !!

 10/01/27 18:24:51 INFO mapred.JobClient: Job complete: job_201001271823_0001
 10/01/27 18:24:51 INFO mapred.JobClient: Counters: 0
 java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.hadoop.examples.Grep.run(Grep.java:69)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.examples.Grep.main(Grep.java:93)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



 Am I doing something wrong (don't say just 'use Linux' :-) )?
 Thanks



Re: do all mappers finish before reducer starts

2010-01-26 Thread Ed Mazur
You're right that the user reduce function cannot be applied until all
maps have completed. The values being reported about job completion
are a bit misleading in this sense. The reduce percentage you're
seeing actually encompasses three parts:

1. Fetching map output data
2. Merging map output data
3. Applying the user reduce function

Only the third part has the constraint of waiting for all maps; the
other two can be done in parallel, hence the reduce percentage
increasing before map completes. 0-33% reduce corresponds to step 1,
33-67% to step 2, and 67-100% to step 3. There is overlap between
parts 1 and 2 as the reduce memory buffer fills up, merges, and spills
to disk. There is also overlap between parts 2 and 3 because the final
merge is fed directly into the user reduce function to minimize the
amount of data written to disk.

Ed

On Tue, Jan 26, 2010 at 5:27 PM, adeelmahmood adeelmahm...@gmail.com wrote:

 I just have a conceptual question. My understanding is that all the mappers
 have to complete their job for the reducers to start working because mappers
 dont know about each other so we need values for a given key from all the
 different mappers so we have to wait until all mappers have collectively
 given the system all possible values for a key .so that then that can be
 passed on the reducer ..
 but when I ran these jobs .. almost everytime before the mappers are all
 done the reducers start working .. so it would say map 60% reduce 30% .. how
 does this works
 Does it finds all possibly values for a single key from all mappers .. pass
 that on the reducer and then works on other keys
 any help is appreciated
 --
 View this message in context: 
 http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Help on processing large amount of videos on hadoop

2009-12-22 Thread Ed Kohlwey
Hi Huazhong,
Sounds like an interesting application. Here's a few tips.

1. If the frames are not independent, you should find a way to key them
according to their order before dumping them in Hadoop so that they can be
sorted as part of your map reduce task. BTW, the video won't appear split
while its in HDFS; HDFS does use a block splitting scheme for replication
and (sort of) job distribution, but this isn't mandatory, and there's lots
of facilites to customize this behavior.
2. Is the audio needed? If not, it may make sense to preprocess the data as
key, image, where the key is something like what I mentioned above, and
the image is a custom writable that handles the data in a common format like
gif, jpg, png, whatever.
3. You'll need to have some in-depth knowledge of your video codec for this.
While I'm not an expert on video codecs, I think that many of them do their
compression by specifying key frames that have complete data, and then
representing subsequent frames as differences with a key frame. You can use
a custom input format and split to split on a frame, but you will need to be
an expert on your codec to do so. It might be easier to use a framework to
pre-transcode the data into whatever you will use for your map reduce jobs.

On Thu, Dec 17, 2009 at 2:25 PM, Huazhong Ning n...@akiira.com wrote:

 Hi,

 I set up a hadoop platform and I am going to use it to process a large
 amount of videos (each size is about 500M-1G). But I met some hard issues:
 1. The frames in each video are not independent so we may have problems if
 we split the video into blocks and distribute them in HDFS.
 2. The video is compressed but we hope the input to the map class is video
 frames. In other words we need to put the codec somewhere.
 3. Our codec (third party source code) takes video file name as input. Can
 we get the file name?

 Any suggestions and comments are welcome. Thanks a lot.

 Ning



Re: Can hadoop 0.20.1 programs runs on Amazon Elastic Mapreduce?

2009-12-16 Thread Ed Kohlwey
Last time I checked EMR only runs 0.18.3. You can use EC2 though, which
winds up being cheaper anyways.

On Wed, Dec 16, 2009 at 8:51 PM, 松柳 lamfeeli...@gmail.com wrote:

 Hi all, I'm wondering whether Amazon starts to support the newest stable
 version of Hadoop, or we can still just use 0.18.3?

 Song Liu



Re: multiple file input

2009-12-08 Thread Ed Kohlwey
One important thing to note is that, with cross products, you'll almost
always get better performance if you can fit both files on a single node's
disk rather than distributing the files.

On Tue, Dec 8, 2009 at 9:18 AM, laser08150815 la...@laserxyz.de wrote:



 pmg wrote:
 
  I am evaluating hadoop for a problem that do a Cartesian product of input
  from one file of 600K (File A) with another set of file set (FileB1,
  FileB2, FileB3) with 2 millions line in total.
 
  Each line from FileA gets compared with every line from FileB1, FileB2
  etc. etc. FileB1, FileB2 etc. are in a different input directory
 
  So
 
  Two input directories
 
  1. input1 directory with a single file of 600K records - FileA
  2. input2 directory segmented into different files with 2Million records
 -
  FileB1, FileB2 etc.
 
  How can I have a map that reads a line from a FileA in directory input1
  and compares the line with each line from input2?
 
  What is the best way forward? I have seen plenty of examples that maps
  each record from single input file and reduces into an output forward.
 
  thanks
 


 I had a similar problem and solved it by writing a custom InputFormat (see
 attachment). You should improve the methods ACrossBInputSplit.getLength ,
 ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress.
 --
 View this message in context:
 http://old.nabble.com/multiple-file-input-tp24095358p26694569.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: RE: Using Hadoop in non-typical large scale user-driven environment

2009-12-02 Thread Ed Kohlwey
As far as replication goes, you should look at a project called pastry.
Apparently some people have used hadoop mapreduce on top of it. You will
need to be clever, however, in how you do your mapreduce because you
probably won't want the job to eat all the users cpu time.

On Dec 2, 2009 5:11 PM, Habermaas, William william.haberm...@fatwire.com
wrote:

Hadoop isn't going to like losing its datanodes when people shutdown their
computers.
More importantly, when the datanodes are running, your users will be
impacted by data replication. Unlike Seti, Hadoop doesn't know when the
user's screensaver is running so it will start doing things when it feels
like it.

Can someone else comment on whether HOD (hadoop-on-demand) would fit this
scenario?
Bill

-Original Message- From: Maciej Trebacz [mailto:
maciej.treb...@gmail.com] Sent: Wednesday,...


Re: New graphic interface for Hadoop - Contains: FileManager, Daemon Admin, Quick Stream Job Setup, etc

2009-11-18 Thread Ed Kohlwey
The tool looks interesting. You should consider providing the source for it.
Is it written in a language that can run on platforms besides windows?

On Nov 17, 2009 10:40 AM, Cubic cubicdes...@gmail.com wrote:

Hi list.
This tool is a graphic interface for Hadoop.
It may improove your productivity quite a bit, especially if you
intensivelly work with files inside the HDFS.

Note:
In my computer it is functional but it hasn't been *yet* tested in
other computers.

Download:
Download link: http://www.dnabaser.com/hadoop-gui-shell/index.html

Please feel free to send feedback.
:)


Re: About Distribute Cache

2009-11-15 Thread Ed Kohlwey
Hi,
What you can fit in distributed cache generally depends on the available
disk space on your nodes. With most clusters 300 mb will not be a problem,
but it depends on the cluster and the workload you're processing.

On Sat, Nov 14, 2009 at 10:34 PM, 于凤东 fengdon...@gmail.com wrote:

 I have a 300MB file, want to put to the distributed cache, but I want to
 know does that is a large file for ditributed cache? and normally, how many
 size files we put into the DC?