Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "HadoopSupport" page has been changed by JonathanEllis: http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=43&rev2=44 == Overview == Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] functionality against Cassandra's data store. Specifically, support has been added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], [[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]]. - [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop distribution called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]]) However this code is no longer going to be maintained by !DataStax. Future !DataStax development of Brisk is now part of a pay-for offering. + [[http://datastax.com|DataStax]] open-sourced a Cassandra based Hadoop distribution called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]]) Brisk is now part of [[http://www.datastax.com/products/enterprise|DataStax Enterprise]] and is no longer maintained as a standalone project. [[#Top|Top]] @@ -35, +35 @@ ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate); }}} As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml. See the `README` in the `word_count` and `pig` contrib modules for more details. - ==== Output To Cassandra ==== As of 0.7, there is a basic mechanism included in Cassandra for outputting data to Cassandra. The `contrib/word_count` example in 0.7 contains two reducers - one for outputting data to the filesystem and one to output data to Cassandra (default) using this new mechanism. See that example in the latest release for details. @@ -73, +72 @@ == Oozie == [[http://incubator.apache.org/oozie/|Oozie]], the open-source workflow engine originally from Yahoo!, can be used with Cassandra/Hadoop. Cassandra configuration information needs to go into the oozie action configuration like so: + {{{ <property> <name>cassandra.thrift.address</name> @@ -99, +99 @@ <value>${cassandraRangeBatchSize}</value> </property> }}} - Note that with Oozie you can specify values outright like the partitioner here, or via variable that is typically found in the properties file. - One other item of note is that Oozie assumes that it can detect a filemarker for successful completion of the job. This means that when writing to Cassandra with, for example, Pig, the Pig script will succeed but the Oozie job that called it will fail because filemarkers aren't written to Cassandra. So when you write to Cassandra with Hadoop, specify this property to avoid that check. Oozie will still get completion updates from a callback from the job tracker, but it just won't look for the filemarker. + Note that with Oozie you can specify values outright like the partitioner here, or via variable that is typically found in the properties file. One other item of note is that Oozie assumes that it can detect a filemarker for successful completion of the job. This means that when writing to Cassandra with, for example, Pig, the Pig script will succeed but the Oozie job that called it will fail because filemarkers aren't written to Cassandra. So when you write to Cassandra with Hadoop, specify this property to avoid that check. Oozie will still get completion updates from a callback from the job tracker, but it just won't look for the filemarker. + {{{ <property> <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name> <value>false</value> </property> }}} - [[#Top|Top]] <<Anchor(ClusterConfig)>> == Cluster Configuration == - The simplest way to configure your cluster to run Cassandra with Hadoop is to use Brisk, the open-source packaging of Cassandra with Hadoop. That will start the `JobTracker` and `TaskTracker` processes for you. It also uses CFS, an HDFS compatible distributed filesystem built on Cassandra that removes the need for a Hadoop `NameNode` and `DataNode` processes. For details, see the Brisk [[http://www.datastax.com/docs/0.8/brisk/index|documentation]] and [[http://github.com/riptano/brisk|code]] Otherwise, if you would like to configure a Cassandra cluster yourself so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. You'll want to have a separate server for your Hadoop `NameNode`/`JobTracker`. Then install a Hadoop `TaskTracker` on each of your Cassandra nodes. That will allow the `JobTracker` to assign tasks to the Cassandra nodes that contain data for those tasks. Also install a Hadoop `DataNode` on each Cassandra node. Hadoop requires a distributed filesystem for copying dependency jars, static data, and intermediate results to be stored. @@ -136, +134 @@ == Troubleshooting == If you are running into timeout exceptions, you might need to tweak one or both of these settings: + * '''cassandra.range.batch.size''' - the default is 4096, but you may need to lower this depending on your data. This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`. * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 it's `RpcTimeoutInMillis` in `storage-conf.xml`). The rpc timeout is not for timing out from the client but between nodes. This can be increased to reduce chances of timing out. If you still see timeout exceptions with resultant failed jobs and/or blacklisted tasktrackers, there are settings that can give Cassandra more latitude before failing the jobs. An example of usage (in either the job configuration or taskracker mapred-site.xml): + {{{ <property> <name>mapred.max.tracker.failures</name> @@ -161, +161 @@ The settings normally default to 4 each, but some find that too conservative. If you set it too low, you might have blacklisted tasktrackers and failed jobs because of occasional timeout exceptions. If you set them too high, jobs that would otherwise fail quickly take a long time to fail, sacrificing efficiency. Keep in mind that this can just cover a problem. It may be that you always want these settings to be higher when operating against Cassandra. However, if you run into these exceptions too frequently, there may be a problem with your Cassandra or Hadoop configuration. If you are seeing inconsistent data coming back, consider the consistency level that you are reading and writing at. The two relevant properties are: + * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE. * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE. + Also hadoop integration uses range scans underneath which do not do read repair. However reading at !ConsistencyLevel.QUORUM will reconcile differences among nodes read. See ReadRepair section as well as the !ConsistencyLevel section of the [[http://wiki.apache.org/cassandra/API|API]] page for more details. [[#Top|Top]]