Re: Task fails: starts over with first input key?

2010-12-14 Thread Keith Wiley
Hmmm, I'll take that under advisement.  So, even if I manually avoided redoing 
earlier work (by keeping a log of which input key/values have been processed 
and short-circuiting the map() if a key/value has already been processed, 
you're saying those previously completed key/values would not be passed on the 
reducer if I skipped them the second time the task was attempted?  Is that 
correct?

Man, I'm trying to figure out the best design here.

My mapper can take up to an hour to process a single input key/value.  If a 
mapper fails on the second input, I really can't afford to calculate the first 
input all over again even though it was successful the first time.  The job 
basically never finishes at that rate of inefficiency.  Reprocessing any data 
even twice is basically unacceptable, much less four times which the number of 
times a task is attempted before giving up and letting the reducer work with 
what it's got (I've tried setMaxMapAttempts(), but it has no affect, tasks are 
always attempted four times regardless of setMaxMapAttempts().).

I wish there were a less burdensome version of skipbadrecords.  I don't want it 
to perform a binary search trying to find the bad record while reprocessing 
data over and over again.  I want it to just skip failed calls to map() and 
move on to the next input key/value.  I want the mapper to just iterate through 
its list of inputs, skipping any that fail, and sending all the successfully 
processed data to the reducer, all in a single nonredundant pass.  Is there any 
way to make Hadoop do that?

Thanks.

Cheers!

On Dec 13, 2010, at 21:46 , Eric Sammer wrote:

 What you are seeing is correct and the intended behavior. The unit of work
 in a MR job is the task. If something causes the task to fail, it starts
 again. Any output from the failed task attempt is throw away. The reducers
 will not see the output of the failed map tasks at all. There is no way
 (within Hadoop proper) to teach a task to be stateful, nor should you as you
 lose a lot of flexibility with respect to features like speculative
 execution and the ability to deal with failures of the machine (unless you
 maintained task state in HDFS or another external system). It's just not
 worth.
 
 On Mon, Dec 13, 2010 at 7:51 PM, Keith Wiley kwi...@keithwiley.com wrote:
 
 I think I am seeing a behavior in which if a mapper task fails (crashes) on
 one input key/value, the entire task is rescheduled and rerun, starting over
 again from the first input key/value even if all of the inputs preceding the
 troublesome input were processed successfully.
 
 Am I correct about this or am I seeing something that isn't there?
 
 If I am correct, what happens to the outputs of the successful duplicate
 map() calls?  Which output key/value is the one that is sent to shuffle (and
 a reducer): Is it the result of the first attempt on the input in question
 or the result of the last attempt?
 
 Is there any way to prevent it from recalculating those duplicate inputs
 other than something manual on the side like keeping a job-log of the map
 attempts and scanning the log at the beginning of each map() call?
 
 Thanks.



Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

Luminous beings are we, not this crude matter.
  -- Yoda






Re: Task fails: starts over with first input key?

2010-12-14 Thread Keith Wiley

On Dec 13, 2010, at 17:58 , li ping wrote:

 I think the *org.apache.hadoop.mapred.SkipBadRecords* is you are looking
 for.


Yes, I considered that at one point.  I don't like how it insists on 
iteratively retrying the records.  I wish it would simply skip the failed 
records and move on, just run the list of input records in a line, skipping the 
bad ones, sending the good ones to the reducer, and otherwise making no further 
attempts at processing.

I'll read up on it again.  Perhaps I missed something.

Thanks.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered.
  -- Keith Wiley






Re: Task fails: starts over with first input key?

2010-12-14 Thread Keith Wiley
On Dec 14, 2010, at 09:30 , Harsh J wrote:

 Hi,
 
 On Tue, Dec 14, 2010 at 10:43 PM, Keith Wiley kwi...@keithwiley.com wrote:
 I wish there were a less burdensome version of skipbadrecords.  I don't want 
 it to perform a binary search trying to find the bad record while 
 reprocessing data over and over again.  I want it to just skip failed calls 
 to map() and move on to the next input key/value.  I want the mapper to just 
 iterate through its list of inputs, skipping any that fail, and sending all 
 the successfully processed data to the reducer, all in a single nonredundant 
 pass.  Is there any way to make Hadoop do that?
 
 You could do this with your application Mapper code, catch bad
 records [try-fail-continue kind of a thing] and push them to a
 different output file rather than the default collector that goes to
 the Reducer [MultipleOutputs, etc. help here] for reprocessing or
 inspection later. Is it not that simple?


I'm not sure I understand, but if you are suggesting that I detect the 
troublesome records through simple try/catch exception handlers, then I'm 
afraid that won't work.  My code is already as resilient as I can possibly make 
it from that point of view.  The task failures are occurring in C++ code which 
is being run via JNI from the mappers.  Despite copious use of exception 
handlers both in Java and in C++, it is inevitable -- as per the nature of C++ 
or any other native compiled code -- that some kinds of errors will simply be 
untrappable.  I have been unsuccessful in trapping some of the errors I am 
facing.  The job tracker reports task failures with standard failure status 
codes (134 and 139 in my case).  It's obvious that the native code is simply 
crashing in some fashion, but I can't figure out how to get Hadoop to 
gracefully handle the situation.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also.
  -- Mark Twain






Task fails: starts over with first input key?

2010-12-13 Thread Keith Wiley
I think I am seeing a behavior in which if a mapper task fails (crashes) on one 
input key/value, the entire task is rescheduled and rerun, starting over again 
from the first input key/value even if all of the inputs preceding the 
troublesome input were processed successfully.

Am I correct about this or am I seeing something that isn't there?

If I am correct, what happens to the outputs of the successful duplicate map() 
calls?  Which output key/value is the one that is sent to shuffle (and a 
reducer): Is it the result of the first attempt on the input in question or the 
result of the last attempt?

Is there any way to prevent it from recalculating those duplicate inputs other 
than something manual on the side like keeping a job-log of the map attempts 
and scanning the log at the beginning of each map() call?

Thanks.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me.
  -- Abe (Grandpa) Simpson






Re: Task fails: starts over with first input key?

2010-12-13 Thread li ping
I think the *org.apache.hadoop.mapred.SkipBadRecords* is you are looking
for.



On Tue, Dec 14, 2010 at 8:51 AM, Keith Wiley kwi...@keithwiley.com wrote:

 I think I am seeing a behavior in which if a mapper task fails (crashes) on
 one input key/value, the entire task is rescheduled and rerun, starting over
 again from the first input key/value even if all of the inputs preceding the
 troublesome input were processed successfully.

 Am I correct about this or am I seeing something that isn't there?

 If I am correct, what happens to the outputs of the successful duplicate
 map() calls?  Which output key/value is the one that is sent to shuffle (and
 a reducer): Is it the result of the first attempt on the input in question
 or the result of the last attempt?

 Is there any way to prevent it from recalculating those duplicate inputs
 other than something manual on the side like keeping a job-log of the map
 attempts and scanning the log at the beginning of each map() call?

 Thanks.


 
 Keith Wiley   kwi...@keithwiley.com
 www.keithwiley.com

 I used to be with it, but then they changed what it was.  Now, what I'm
 with
 isn't it, and what's it seems weird and scary to me.
  -- Abe (Grandpa) Simpson

 






-- 
-李平


Re: Task fails: starts over with first input key?

2010-12-13 Thread Eric Sammer
What you are seeing is correct and the intended behavior. The unit of work
in a MR job is the task. If something causes the task to fail, it starts
again. Any output from the failed task attempt is throw away. The reducers
will not see the output of the failed map tasks at all. There is no way
(within Hadoop proper) to teach a task to be stateful, nor should you as you
lose a lot of flexibility with respect to features like speculative
execution and the ability to deal with failures of the machine (unless you
maintained task state in HDFS or another external system). It's just not
worth.

On Mon, Dec 13, 2010 at 7:51 PM, Keith Wiley kwi...@keithwiley.com wrote:

 I think I am seeing a behavior in which if a mapper task fails (crashes) on
 one input key/value, the entire task is rescheduled and rerun, starting over
 again from the first input key/value even if all of the inputs preceding the
 troublesome input were processed successfully.

 Am I correct about this or am I seeing something that isn't there?

 If I am correct, what happens to the outputs of the successful duplicate
 map() calls?  Which output key/value is the one that is sent to shuffle (and
 a reducer): Is it the result of the first attempt on the input in question
 or the result of the last attempt?

 Is there any way to prevent it from recalculating those duplicate inputs
 other than something manual on the side like keeping a job-log of the map
 attempts and scanning the log at the beginning of each map() call?

 Thanks.


 
 Keith Wiley   kwi...@keithwiley.com
 www.keithwiley.com

 I used to be with it, but then they changed what it was.  Now, what I'm
 with
 isn't it, and what's it seems weird and scary to me.
  -- Abe (Grandpa) Simpson

 






-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com