[Cassandra Wiki] Update of "HadoopSupport" by GabrieleR enzi

Apache Wiki Tue, 25 May 2010 10:24:11 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by GabrieleRenzi.
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=7&rev2=8

--------------------------------------------------

  
  Cassandra's splits are location-aware (this is the nature of the Hadoop 
InputSplit design).  Cassandra gives hadoop a list of locations with each split 
of data, and Hadoop tries to schedule jobs on instances near that data, which 
in practice means you should have Hadoop instances on each of your Cassandra 
machines.
  
+ Releases before  0.6.2/0.7 are affected by a small resource leak that may 
cause jobs to fail (connections are not released properly, causing a resource 
leak). Depending on your local setup you may hit this issue, and workaround it 
by raising the limit of open file descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`). 
+ The error will be reported on the hadoop job side as a thrift 
TimedOutException.
+ 
+ If you are testing the integration against a single node and you obtain some 
failures, this may be normal: you are probably overloading the single machine, 
which may again result in timeout errors. You can workaround it by reducing the 
number of concurrent tasks
+ {{{
+              Configuration conf = job.getConfiguration(); 
+              conf.setInt("mapred.tasktracker.map.tasks.maximum",1); 
+ }}}
+ 
+ Also, you may reduce the size in rows of the batch you are reading from 
cassandra 
+ {{{
+              ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
+ }}}
+

[Cassandra Wiki] Update of "HadoopSupport" by GabrieleR enzi

Reply via email to