Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "HadoopSupport" page has been changed by jeremyhanna. The comment on this change is: Adding some more troubleshooting info in a separate section.. http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=26&rev2=27 -------------------------------------------------- * [[#Pig|Pig]] * [[#Hive|Hive]] * [[#ClusterConfig|Cluster Configuration]] + * [[#Troubleshooting|Troubleshooting]] * [[#Support|Support]] <<Anchor(Overview)>> @@ -37, +38 @@ ==== Hadoop Streaming ==== As of 0.7, there is support for [[http://hadoop.apache.org/common/docs/r0.20.0/streaming.html|Hadoop Streaming]]. For examples on how to use Streaming with Cassandra, see the contrib section of the Cassandra source. The relevant tickets are [[https://issues.apache.org/jira/browse/CASSANDRA-1368|CASSANDRA-1368]] and [[https://issues.apache.org/jira/browse/CASSANDRA-1497|CASSANDRA-1497]]. - - ==== Some troubleshooting ==== - Releases before 0.6.2/0.7 are affected by a small resource leak that may cause jobs to fail (connections are not released properly, causing a resource leak). Depending on your local setup you may hit this issue, and workaround it by raising the limit of open file descriptors for the process (e.g. in linux/bash using `ulimit -n 32000`). The error will be reported on the hadoop job side as a thrift !TimedOutException. - - If you are testing the integration against a single node and you obtain some failures, this may be normal: you are probably overloading the single machine, which may again result in timeout errors. You can workaround it by reducing the number of concurrent tasks - - {{{ - Configuration conf = job.getConfiguration(); - conf.setInt("mapred.tasktracker.map.tasks.maximum",1); - }}} - Also, you may reduce the size in rows of the batch you are reading from cassandra - - {{{ - ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000); - }}} [[#Top|Top]] <<Anchor(Pig)>> @@ -93, +79 @@ [[#Top|Top]] + <<Anchor(Troubleshooting)>> + + == Troubleshooting == + If you are running into timeout exceptions, you might need to tweak one or both of these settings: + * '''cassandra.range.batch.size''' - the default is 4096, but you may need to lower this depending on your data. This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`. + * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 it's `RpcTimeoutInMillis` in `storage-conf.xml`). The rpc timeout is not for timing out from the client but between nodes. This can be increased to reduce chances of timing out. + + Releases before 0.6.2/0.7 are affected by a small resource leak that may cause jobs to fail (connections are not released properly, causing a resource leak). Depending on your local setup you may hit this issue, and workaround it by raising the limit of open file descriptors for the process (e.g. in linux/bash using `ulimit -n 32000`). The error will be reported on the hadoop job side as a thrift !TimedOutException. + + If you are testing the integration against a single node and you obtain some failures, this may be normal: you are probably overloading the single machine, which may again result in timeout errors. You can workaround it by reducing the number of concurrent tasks + + {{{ + Configuration conf = job.getConfiguration(); + conf.setInt("mapred.tasktracker.map.tasks.maximum",1); + }}} + Also, you may reduce the size in rows of the batch you are reading from cassandra + + {{{ + ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000); + }}} + + [[#Top|Top]] + <<Anchor(Support)>> == Support ==