Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The "HadoopSupport" page has been changed by EricGilmore.
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=24&rev2=25

--------------------------------------------------

  <<Anchor(MapReduce)>>
  
  == MapReduce ==
- 
  ==== Input from Cassandra ====
  Cassandra 0.6+ adds support for retrieving data from Cassandra.  This is 
based on implementations of 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]],
 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]],
 and 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader]]
 so that Hadoop !MapReduce jobs can retrieve data from Cassandra.  For an 
example of how this works, see the contrib/word_count example in 0.6 or later.  
Cassandra rows or row  fragments (that is, pairs of key + `SortedMap`  of 
columns) are input to Map tasks for  processing by your job, as specified by a 
`SlicePredicate`  that describes which columns to fetch from each row.
  
@@ -31, +30 @@

              SlicePredicate predicate = new 
SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
              ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
- 
  As of 0.7, configuration for Hadoop no longer resides in your job's specific 
storage-conf.xml. See the `README` in the `word_count` and `pig` contrib 
modules for more details.
  
  ==== Output To Cassandra ====
- 
- As of 0.7, there is be a basic mechanism included in Cassandra for outputting 
data to Cassandra.  The `contrib/word_count` example in 0.7 contains two 
reducers - one for outputting data to the filesystem and one to output data to 
Cassandra (default) using this new mechanism.  See that example in the latest 
release for details.
+ As of 0.7, there is a basic mechanism included in Cassandra for outputting 
data to Cassandra.  The `contrib/word_count` example in 0.7 contains two 
reducers - one for outputting data to the filesystem and one to output data to 
Cassandra (default) using this new mechanism.  See that example in the latest 
release for details.
  
  ==== Hadoop Streaming ====
- 
  As of 0.7, there is support for 
[[http://hadoop.apache.org/common/docs/r0.20.0/streaming.html|Hadoop 
Streaming]].  For examples on how to use Streaming with Cassandra, see the 
contrib section of the Cassandra source.  The relevant tickets are 
[[https://issues.apache.org/jira/browse/CASSANDRA-1368|CASSANDRA-1368]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1497|CASSANDRA-1497]].
  
  ==== Some troubleshooting ====
- 
  Releases before  0.6.2/0.7 are affected by a small  resource leak that may 
cause jobs to fail (connections are not released  properly, causing a resource 
leak). Depending on your local setup you  may hit this issue, and workaround it 
by raising the limit of open file  descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`).  The error will be reported on  the hadoop 
job side as a thrift !TimedOutException.
  
  If you are testing the integration against a single node and you obtain  some 
failures, this may be normal: you are probably overloading the  single machine, 
which may again result in timeout errors. You can  workaround it by reducing 
the number of concurrent tasks
@@ -57, +52 @@

  {{{
               ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
  }}}
- 
  [[#Top|Top]]
  
  <<Anchor(Pig)>>
@@ -66, +60 @@

  Cassandra 0.6+ also adds support for [[http://pig.apache.org|Pig]] with its 
own implementation of 
[[http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]].
  This allows Pig queries to be run against data stored in Cassandra.  For an 
example of this, see the `contrib/pig` example in 0.6 and later.
  
  When running Pig with Cassandra + Hadoop on a cluster, be sure to follow the 
`README` notes in the `<cassandra_src>/contrib/pig` directory, the 
[[#ClusterConfig|Cluster Configuration]] section on this page, and some 
additional notes here:
+ 
   * Set the `HADOOP_HOME` environment variable to `<hadoop_dir>`, e.g. 
`/opt/hadoop` or `/etc/hadoop`
   * Set the `PIG_CONF` environment variable to `<hadoop_dir>/conf`
   * Set the `JAVA_HOME`
@@ -82, +77 @@

  <<Anchor(ClusterConfig)>>
  
  == Cluster Configuration ==
- 
  If you would like to configure a Cassandra cluster so that Hadoop may operate 
over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. 
 You'll want to have a separate server for your Hadoop namenode/`JobTracker`.  
Then install a Hadoop `TaskTracker` on each of your Cassandra nodes.  That will 
allow the `Jobtracker` to assign tasks to the Cassandra nodes that contain data 
for those tasks.  At least one node in your cluster will also need to be a 
datanode.  That's because Hadoop uses HDFS to store information like jar 
dependencies for your job, static data (like stop words for a word count), and 
things like that - it's the distributed cache.  It's a very small amount of 
data but the Hadoop cluster needs it to run properly.
  
  The nice thing about having a `TaskTracker` on every node is that you get 
data locality and your analytics engine scales with your data. You also never 
need to shuttle around your data once you've performed analytics on it - you 
simply output to Cassandra and you are able to access that data with high 
random-read performance.
  
  One configuration note on getting the task trackers to be able to perform 
queries over Cassandra, you'll want to update your `HADOOP_CLASSPATH` in your 
`<hadoop>/conf/hadoop-env.sh` to include the Cassandra lib libraries.  For 
example you'll want to do something like this in the `hadoop-env.sh` on each of 
your task trackers:
+ 
  {{{
  export HADOOP_CLASSPATH=/opt/cassandra/lib/*:$HADOOP_CLASSPATH
  }}}
- 
  ==== Virtual Datacenter ====
  One thing that many have asked about is whether Cassandra with Hadoop will be 
usable from a random access perspective. For example, you may need to use 
Cassandra for serving web latency requests. You may also need to run analytics 
over your data. In Cassandra 0.7+ there is the !NetworkTopologyStrategy which 
allows you to customize your cluster's replication strategy by datacenter. What 
you can do with this is create a 'virtual datacenter' to separate nodes that 
serve data with high random-read performance from nodes that are meant to be 
used for analytics. You need to have a snitch configured with your topology and 
then according to the datacenters defined there (either explicitly or 
implicitly), you can indicate how many replicas you would like in each 
datacenter. You would install task trackers on nodes in your analytics section 
and make sure that a replica is written to that 'datacenter' in your 
!NetworkTopologyStrategy configuration. The practical upshot of this is your 
analytics nodes always have current data and your high random-read performance 
nodes always serve data with predictable performance.
  
@@ -100, +94 @@

  [[#Top|Top]]
  
  <<Anchor(Support)>>
+ 
  == Support ==
  Sometimes configuration and integration can get tricky. To get support for 
this functionality, start with the `contrib` examples in the source download of 
Cassandra. Make sure you are following instructions in the `README` file for 
that example. You can search the Cassandra user mailing list or post on there 
as it is very active. You can also ask in the #Cassandra irc channel on 
freenode for help. Other channels that might be of use are #hadoop, 
#hadoop-pig, and #hive. Those projects' mailing lists are also very active.
  

Reply via email to