[Cassandra Wiki] Update of "HadoopSupport" by JonathanEllis

Apache Wiki Wed, 28 Sep 2011 14:18:10 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=43&rev2=44

  == Overview ==
  Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] 
functionality against Cassandra's data store.  Specifically, support has been 
added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], 
[[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]].
  
- [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop 
distribution called Brisk. 
([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) 
([[http://github.com/riptano/brisk|Code]]) However this code is no longer going 
to be maintained by !DataStax.  Future !DataStax development of Brisk is now 
part of a pay-for offering.
+ [[http://datastax.com|DataStax]] open-sourced a Cassandra based Hadoop 
distribution called Brisk. 
([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) 
([[http://github.com/riptano/brisk|Code]]) Brisk is now part of 
[[http://www.datastax.com/products/enterprise|DataStax Enterprise]] and is no 
longer maintained as a standalone project.
  
  [[#Top|Top]]
  
@@ -35, +35 @@

              ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
  As of 0.7, configuration for Hadoop no longer resides in your job's specific 
storage-conf.xml. See the `README` in the `word_count` and `pig` contrib 
modules for more details.
- 
  
  ==== Output To Cassandra ====
  As of 0.7, there is a basic mechanism included in Cassandra for outputting 
data to Cassandra.  The `contrib/word_count` example in 0.7 contains two 
reducers - one for outputting data to the filesystem and one to output data to 
Cassandra (default) using this new mechanism.  See that example in the latest 
release for details.
@@ -73, +72 @@

  
  == Oozie ==
  [[http://incubator.apache.org/oozie/|Oozie]], the open-source workflow engine 
originally from Yahoo!, can be used with Cassandra/Hadoop.  Cassandra 
configuration information needs to go into the oozie action configuration like 
so:
+ 
  {{{
  <property>
      <name>cassandra.thrift.address</name>
@@ -99, +99 @@

      <value>${cassandraRangeBatchSize}</value>
  </property>
  }}}
- Note that with Oozie you can specify values outright like the partitioner 
here, or via variable that is typically found in the properties file.
- One other item of note is that Oozie assumes that it can detect a filemarker 
for successful completion of the job.  This means that when writing to 
Cassandra with, for example, Pig, the Pig script will succeed but the Oozie job 
that called it will fail because filemarkers aren't written to Cassandra.  So 
when you write to Cassandra with Hadoop, specify this property to avoid that 
check.  Oozie will still get completion updates from a callback from the job 
tracker, but it just won't look for the filemarker.
+ Note that with Oozie you can specify values outright like the partitioner 
here, or via variable that is typically found in the properties file. One other 
item of note is that Oozie assumes that it can detect a filemarker for 
successful completion of the job.  This means that when writing to Cassandra 
with, for example, Pig, the Pig script will succeed but the Oozie job that 
called it will fail because filemarkers aren't written to Cassandra.  So when 
you write to Cassandra with Hadoop, specify this property to avoid that check.  
Oozie will still get completion updates from a callback from the job tracker, 
but it just won't look for the filemarker.
+ 
  {{{
  <property>
      <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>
      <value>false</value>
  </property>
  }}}
- 
  [[#Top|Top]]
  
  <<Anchor(ClusterConfig)>>
  
  == Cluster Configuration ==
- 
  The simplest way to configure your cluster to run Cassandra with Hadoop is to 
use Brisk, the open-source packaging of Cassandra with Hadoop.  That will start 
the `JobTracker` and `TaskTracker` processes for you.  It also uses CFS, an 
HDFS compatible distributed filesystem built on Cassandra that removes the need 
for a Hadoop `NameNode` and `DataNode` processes.  For details, see the Brisk 
[[http://www.datastax.com/docs/0.8/brisk/index|documentation]] and 
[[http://github.com/riptano/brisk|code]]
  
  Otherwise, if you would like to configure a Cassandra cluster yourself so 
that Hadoop may operate over its data, it's best to overlay a Hadoop cluster 
over your Cassandra nodes.  You'll want to have a separate server for your 
Hadoop `NameNode`/`JobTracker`.  Then install a Hadoop `TaskTracker` on each of 
your Cassandra nodes.  That will allow the `JobTracker` to assign tasks to the 
Cassandra nodes that contain data for those tasks.  Also install a Hadoop 
`DataNode` on each Cassandra node.  Hadoop requires a distributed filesystem 
for copying dependency jars, static data, and intermediate results to be stored.
@@ -136, +134 @@

  
  == Troubleshooting ==
  If you are running into timeout exceptions, you might need to tweak one or 
both of these settings:
+ 
   * '''cassandra.range.batch.size''' - the default is 4096, but you may need 
to lower this depending on your data.  This is either specified in your hadoop 
configuration or using 
`org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
   * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 
it's `RpcTimeoutInMillis` in `storage-conf.xml`).  The rpc timeout is not for 
timing out from the client but between nodes.  This can be increased to reduce 
chances of timing out.
  
  If you still see timeout exceptions with resultant failed jobs and/or 
blacklisted tasktrackers, there are settings that can give Cassandra more 
latitude before failing the jobs.  An example of usage (in either the job 
configuration or taskracker mapred-site.xml):
+ 
  {{{
  <property>
    <name>mapred.max.tracker.failures</name>
@@ -161, +161 @@

  The settings normally default to 4 each, but some find that too conservative. 
 If you set it too low, you might have blacklisted tasktrackers and failed jobs 
because of occasional timeout exceptions.  If you set them too high, jobs that 
would otherwise fail quickly take a long time to fail, sacrificing efficiency.  
Keep in mind that this can just cover a problem.  It may be that you always 
want these settings to be higher when operating against Cassandra.  However, if 
you run into these exceptions too frequently, there may be a problem with your 
Cassandra or Hadoop configuration.
  
  If you are seeing inconsistent data coming back, consider the consistency 
level that you are reading and writing at.  The two relevant properties are:
+ 
   * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE.
   * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE.
+ 
  Also hadoop integration uses range scans underneath which do not do read 
repair.  However reading at !ConsistencyLevel.QUORUM will reconcile differences 
among nodes read.  See ReadRepair section as well as the !ConsistencyLevel 
section of the [[http://wiki.apache.org/cassandra/API|API]] page for more 
details.
  
  [[#Top|Top]]

[Cassandra Wiki] Update of "HadoopSupport" by JonathanEllis

Reply via email to