[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Apache Wiki Thu, 03 Mar 2011 16:04:49 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by jeremyhanna.
The comment on this change is: Adding some more troubleshooting info in a 
separate section..
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=26&rev2=27

--------------------------------------------------

   * [[#Pig|Pig]]
   * [[#Hive|Hive]]
   * [[#ClusterConfig|Cluster Configuration]]
+  * [[#Troubleshooting|Troubleshooting]]
   * [[#Support|Support]]
  
  <<Anchor(Overview)>>
@@ -37, +38 @@

  
  ==== Hadoop Streaming ====
  As of 0.7, there is support for 
[[http://hadoop.apache.org/common/docs/r0.20.0/streaming.html|Hadoop 
Streaming]].  For examples on how to use Streaming with Cassandra, see the 
contrib section of the Cassandra source.  The relevant tickets are 
[[https://issues.apache.org/jira/browse/CASSANDRA-1368|CASSANDRA-1368]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1497|CASSANDRA-1497]].
- 
- ==== Some troubleshooting ====
- Releases before  0.6.2/0.7 are affected by a small  resource leak that may 
cause jobs to fail (connections are not released  properly, causing a resource 
leak). Depending on your local setup you  may hit this issue, and workaround it 
by raising the limit of open file  descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`).  The error will be reported on  the hadoop 
job side as a thrift !TimedOutException.
- 
- If you are testing the integration against a single node and you obtain  some 
failures, this may be normal: you are probably overloading the  single machine, 
which may again result in timeout errors. You can  workaround it by reducing 
the number of concurrent tasks
- 
- {{{
-              Configuration conf = job.getConfiguration();
-              conf.setInt("mapred.tasktracker.map.tasks.maximum",1);
- }}}
- Also, you may reduce the size in rows of the batch you  are reading from 
cassandra
- 
- {{{
-              ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
- }}}
  [[#Top|Top]]
  
  <<Anchor(Pig)>>
@@ -93, +79 @@

  
  [[#Top|Top]]
  
+ <<Anchor(Troubleshooting)>>
+ 
+ == Troubleshooting ==
+ If you are running into timeout exceptions, you might need to tweak one or 
both of these settings:
+  * '''cassandra.range.batch.size''' - the default is 4096, but you may need 
to lower this depending on your data.  This is either specified in your hadoop 
configuration or using 
`org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
+  * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 
it's `RpcTimeoutInMillis` in `storage-conf.xml`).  The rpc timeout is not for 
timing out from the client but between nodes.  This can be increased to reduce 
chances of timing out.
+ 
+ Releases before 0.6.2/0.7 are affected by a small resource leak that may 
cause jobs to fail (connections are not released  properly, causing a resource 
leak). Depending on your local setup you may hit this issue, and workaround it 
by raising the limit of open file descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`).  The error will be reported on the hadoop 
job side as a thrift !TimedOutException.
+ 
+ If you are testing the integration against a single node and you obtain some 
failures, this may be normal: you are probably overloading the single machine, 
which may again result in timeout errors. You can workaround it by reducing the 
number of concurrent tasks
+ 
+ {{{
+              Configuration conf = job.getConfiguration();
+              conf.setInt("mapred.tasktracker.map.tasks.maximum",1);
+ }}}
+ Also, you may reduce the size in rows of the batch you  are reading from 
cassandra
+ 
+ {{{
+              ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
+ }}}
+ 
+ [[#Top|Top]]
+ 
  <<Anchor(Support)>>
  
  == Support ==

[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Reply via email to