I was able to workaround this problem by modifying the ColumnFamilyRecordReader class from the org.apache.cassandra.hadoop package. Since the errors where TimeoutException, I added sleep and retry logic around rows = client.get_range_slices(keyspace,
new ColumnParent(cfName), predicate, keyRange, ConsistencyLevel.ONE); in RowIterator.maybeInit() function. And it works. :) Check out http://pastebin.com/FxV4Gw5U for the modified maybeInit() function. Please note that I also made slight modification to the ConfigHelper to pass-in the sleep time and the max retry count. The ConfigHelper.getRetryCount(conf) and ConfigHelper.getSleepTime(conf); are not part of the original ConfigHelper. Hope this helps anyone facing similar problems. Regards, Jairam On 14 January 2011 11:59, Jairam Chandar <jairam.chan...@imagini.net> wrote: > The cassandra logs strangely show no errors at the time of failure. > Changing the RPCTimeoutInMillis seemed to help. Though it slowed down the > job considerably, it seems to be finishing by changing the timeout value > to 1 min. Unfortunately, I cannot be sure if it will continue to work if > the data increases further. Hopefully will be upgrading to the recently > released final version of 0.7.0. > > Thanks for all the help and suggestions. > > Warm regards, > Jairam Chandar > > On 13/01/2011 14:47, "Jeremy Hanna" <jeremy.hanna1...@gmail.com> wrote: > > >On Jan 12, 2011, at 12:40 PM, Jairam Chandar wrote: > > > >> Hi folks, > >> > >> We have a Cassandra 0.6.6 cluster running in production. We want to run > >>Hadoop (version 0.20.2) jobs over this cluster in order to generate > >>reports. > >> I modified the word_count example in the contrib folder of the > >>cassandra distribution. While the program is running fine for small > >>datasets (in the order of 100-200 MB) on small clusters (2 machines), it > >>starts to give errors while trying to run on a bigger cluster (5 > >>machines) with much larger dataset (400 GB). Here is the error that we > >>get - > >> > >> java.lang.RuntimeException: TimedOutException() > >> at > >>org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeIni > >>t(ColumnFamilyRecordReader.java:186) > >> at > >>org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeN > >>ext(ColumnFamilyRecordReader.java:236) > >> at > >>org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeN > >>ext(ColumnFamilyRecordReader.java:104) > >> at > >>com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractItera > >>tor.java:135) > >> at > >>com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java: > >>130) > >> at > >>org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnF > >>amilyRecordReader.java:98) > >> at > >>org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Map > >>Task.java:423) > >> at > >>org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) > >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) > >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > >> at org.apache.hadoop.mapred.Child.main(Child.java:170) > >> Caused by: TimedOutException() > >> at > >>org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassan > >>dra.java:11094) > >> at > >>org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassan > >>dra.java:628) > >> at > >>org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.j > >>ava:602) > >> at > >>org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeIni > >>t(ColumnFamilyRecordReader.java:164) > >> ... 11 more > >> > > > >I wonder if messing with RpcTimeoutInMillis in storage-conf.xml would > >help. > > > >> > >> > >> > >> I came across this page on the Cassandra wiki - > >>http://wiki.apache.org/cassandra/HadoopSupport and tried modifying the > >>ulimit and changing batch sizes. These did not help. Though the number > >>of successful map tasks increased, it eventually fails since the total > >>number of map tasks is huge. > >> > >> Any idea on what could be causing this? The program we are running is a > >>very slight modification of the word_count example with respect to > >>reading from Cassandra. The only change being specific keyspace, > >>columnfamily and columns. The rest of the code for reading is the same > >>as the word_count example in the source code for Cassandra 0.6.6. > >> > >> Thanks and regards, > >> Jairam Chandar > > > > >