The problem could be that you are crunching more data than will be
completed within the interval expire setting.

In Hadoop you need to kind of tell the task tracker that you are still
doing stuff which is done by setting status or incrementing counter on
the Reporter object.

http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/

"In your Java code there is a little trick to help the job be “aware”
within the cluster of tasks that are not dead but just working hard.
During execution of a task there is no built in reporting that the job
is running as expected if it is not writing out.  So this means that
if your tasks are taking up a lot of time doing work it is possible
the cluster will see that task as failed (based on the
mapred.task.tracker.expiry.interval setting).

Have no fear there is a way to tell cluster that your task is doing
just fine.  You have 2 ways todo this you can either report the status
or increment a counter.  Both of these will cause the task tracker to
properly know the task is ok and this will get seen by the jobtracker
in turn.  Both of these options are explained in the JavaDoc
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html";

Hope this helps

On Fri, May 7, 2010 at 4:47 AM, gabriele renzi <rff....@gmail.com> wrote:
> Hi everyone,
>
> I am trying to develop a mapreduce job that does a simple
> selection+filter on the rows in our store.
> Of course it is mostly based on the WordCount example :)
>
>
> Sadly, while it seems the app runs fine on a test keyspace with little
> data, when run on a larger test index (but still on a single node) I
> reliably see this error in the logs
>
> 10/05/06 16:37:58 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.RuntimeException: TimedOutException()
>        at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
>        at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
>        at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
>        at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>        at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>        at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
>        at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
>        at 
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>        at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
> Caused by: TimedOutException()
>        at 
> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
>        at 
> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
>        at 
> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
>        at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
>        ... 11 more
>
> and after that the job seems to finish "normally" but no results are produced.
>
> FWIW this is on 0.6.0 (we didn't move to 0.6.1 yet because, well, if
> it ain't broke don't fix it).
>
> The single node has a data directory of about 127GB in two column
> families, off which the one used in the mapred job is about 100GB.
> The cassandra server is run with 6GB of heap on a box with 8GB
> available and no swap enabled. read/write latency from cfstat are
>
>        Read Latency: 0.8535837762577986 ms.
>        Write Latency: 0.028849603764075547 ms.
>
> row cache is not enabled, key cache percentage is default. Load on the
> machine is basically zero when the job is not running.
>
> As my code is 99% that from the wordcount contrib, I shall notice that
> In 0.6.1's contrib (and trunk) there is a RING_DELAY constant that we
> can supposedly change, but it's apparently not used anywhere, but as I
> said, running on a single node this should not be an issue anyway.
>
> Does anyone has suggestions or has seen this error before? On the
> other hand, did people run this kind of jobs in similar conditions
> flawlessly, so I can consider it just my problem?
>
>
> Thanks in advance for any help.
>



-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Reply via email to