Author: jbellis Date: Sat Mar 27 20:33:03 2010 New Revision: 928264 URL: http://svn.apache.org/viewvc?rev=928264&view=rev Log: add pig loadfunc to contrib. patch by Stu Hood; reviewed by jbellis for CASSANDRA-910
Added: cassandra/branches/cassandra-0.6/contrib/pig/ cassandra/branches/cassandra-0.6/contrib/pig/README.txt (with props) cassandra/branches/cassandra-0.6/contrib/pig/bin/ cassandra/branches/cassandra-0.6/contrib/pig/bin/pig_cassandra cassandra/branches/cassandra-0.6/contrib/pig/build.xml (with props) cassandra/branches/cassandra-0.6/contrib/pig/src/ cassandra/branches/cassandra-0.6/contrib/pig/src/java/ cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/ cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/ cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/ cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/ cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/ cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java (with props) cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml (with props) Modified: cassandra/branches/cassandra-0.6/CHANGES.txt Modified: cassandra/branches/cassandra-0.6/CHANGES.txt URL: http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/CHANGES.txt?rev=928264&r1=928263&r2=928264&view=diff ============================================================================== --- cassandra/branches/cassandra-0.6/CHANGES.txt (original) +++ cassandra/branches/cassandra-0.6/CHANGES.txt Sat Mar 27 20:33:03 2010 @@ -13,6 +13,7 @@ to top level supercolumns" (CASSANDRA-834) * Streaming destination nodes do not update their JMX status (CASSANDRA-916) * Fix internal RPC timeout calculation (CASSANDRA-911) + * Added Pig loadfunc to contrib/pig (CASSANDRA-910) 0.6.0-beta3 Added: cassandra/branches/cassandra-0.6/contrib/pig/README.txt URL: http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/contrib/pig/README.txt?rev=928264&view=auto ============================================================================== --- cassandra/branches/cassandra-0.6/contrib/pig/README.txt (added) +++ cassandra/branches/cassandra-0.6/contrib/pig/README.txt Sat Mar 27 20:33:03 2010 @@ -0,0 +1,29 @@ +A Pig LoadFunc that reads all columns from a given ColumnFamily. + +Setup: + +First build and start a Cassandra server with the default +configuration* and set the PIG_HOME and JAVA_HOME environment +variables to the location of a Pig >= 0.7.0-dev install and your Java +install. If you would like to run using the Hadoop backend, you should +also set PIG_CONF_DIR to the location of your Hadoop config. + +Run: + +contrib/pig$ ant +contrib/pig$ bin/pig_cassandra + +Once the 'grunt>' shell has loaded, try a simple program like the +following, which will determine the top 50 column names: + +grunt> rows = LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage(); +grunt> cols = FOREACH rows GENERATE flatten($1); +grunt> colnames = FOREACH cols GENERATE $0; +grunt> namegroups = GROUP colnames BY $0; +grunt> namecounts = FOREACH namegroups GENERATE COUNT($1), group; +grunt> orderednames = ORDER namecounts BY $0; +grunt> topnames = LIMIT orderednames 50; +grunt> dump topnames; + +*If you want to point Pig at a real cluster, modify the seed +address in storage-conf.xml and re-run the build step. Propchange: cassandra/branches/cassandra-0.6/contrib/pig/README.txt ------------------------------------------------------------------------------ svn:eol-style = native Added: cassandra/branches/cassandra-0.6/contrib/pig/bin/pig_cassandra URL: http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/contrib/pig/bin/pig_cassandra?rev=928264&view=auto ============================================================================== --- cassandra/branches/cassandra-0.6/contrib/pig/bin/pig_cassandra (added) +++ cassandra/branches/cassandra-0.6/contrib/pig/bin/pig_cassandra Sat Mar 27 20:33:03 2010 @@ -0,0 +1,50 @@ +#!/bin/sh + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +cwd=`dirname $0` +cassandra_home="$cwd/../../../" + +# general jars. +for jar in $cassandra_home/lib/*.jar $cassandra_home/build/lib/jars/*.jar; do + CLASSPATH=$CLASSPATH:$jar +done + +# cassandra_loadfunc jar. +LOADFUNC_JAR=`ls -1 $cwd/../build/*.jar` +if [ ! -e $LOADFUNC_JAR ]; then + echo "Unable to locate cassandra_loadfunc jar: please run ant." >&2 + exit 1 +fi +CLASSPATH=$CLASSPATH:$LOADFUNC_JAR + +if [ "x$PIG_HOME" = "x" ]; then + echo "PIG_HOME not set: requires Pig >= 0.7.0-dev" >&2 + exit 1 +fi + +# pig jar. +PIG_JAR=$PIG_HOME/pig.jar +if [ ! -e $PIG_JAR ]; then + echo "Unable to locate Pig jar" >&2 + exit 1 +fi +CLASSPATH=$CLASSPATH:$PIG_JAR + +export PIG_CLASSPATH=$PIG_CLASSPATH:$CLASSPATH +export PIG_OPTS=$PIG_OPTS" -Dudf.import.list=org.apache.cassandra.hadoop.pig" +cat "$cwd/../build/bootstrap.pig" - | $PIG_HOME/bin/pig $* Added: cassandra/branches/cassandra-0.6/contrib/pig/build.xml URL: http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/contrib/pig/build.xml?rev=928264&view=auto ============================================================================== --- cassandra/branches/cassandra-0.6/contrib/pig/build.xml (added) +++ cassandra/branches/cassandra-0.6/contrib/pig/build.xml Sat Mar 27 20:33:03 2010 @@ -0,0 +1,74 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> +<project basedir="." default="jar" name="cassandra_loadfunc"> + <!-- stores the environment for locating PIG_HOME --> + <property environment="env" /> + <property name="cassandra.dir" value="../.." /> + <property name="cassandra.lib" value="" /> + <property name="cassandra.classes" value="${cassandra.dir}/build/classes" /> + <property name="build.src" value="${basedir}/src" /> + <property name="build.lib" value="${basedir}/lib" /> + <property name="build.out" value="${basedir}/build" /> + <property name="build.classes" value="${build.out}/classes" /> + <property name="final.name" value="cassandra_loadfunc" /> + + <path id="pig.classpath"> + <pathelement location="${env.PIG_HOME}/pig.jar" /> + <fileset dir="${cassandra.dir}/lib"> + <include name="libthrift*.jar" /> + </fileset> + <fileset dir="${cassandra.dir}/build/lib/jars"> + <include name="google-collections*.jar" /> + </fileset> + </path> + + <path id="classpath"> + <path refid="pig.classpath" /> + <pathelement location="${cassandra.classes}" /> + </path> + + <target name="init"> + <mkdir dir="${build.classes}" /> + </target> + + <target depends="init" name="build"> + <fail unless="env.PIG_HOME" message="Please set PIG_HOME to the location of a Pig >= 0.7.0-dev install." /> + <javac destdir="${build.classes}"> + <src path="${build.src}" /> + <classpath refid="classpath" /> + </javac> + <!-- Build a line of jar registrations for use in the pig startup script --> + <pathconvert pathsep="; register " property="register.line" refid="pig.classpath" /> + <echo message="register ${register.line};${line.separator}" file="${build.out}/bootstrap.pig" /> + </target> + + <target name="jar" depends="build"> + <mkdir dir="${build.classes}/META-INF" /> + <jar jarfile="${build.out}/${final.name}.jar"> + <fileset dir="${build.classes}" /> + <fileset dir="${cassandra.classes}" /> + <fileset file="${basedir}/storage-conf.xml" /> + </jar> + </target> + + <target name="clean"> + <delete dir="${build.out}" /> + </target> +</project> Propchange: cassandra/branches/cassandra-0.6/contrib/pig/build.xml ------------------------------------------------------------------------------ svn:eol-style = native Added: cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java URL: http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?rev=928264&view=auto ============================================================================== --- cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java (added) +++ cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java Sat Mar 27 20:33:03 2010 @@ -0,0 +1,145 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with this + * work for additional information regarding copyright ownership. The ASF + * licenses this file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ +package org.apache.cassandra.hadoop.pig; + +import java.io.IOException; +import java.util.*; + +import org.apache.cassandra.db.Column; +import org.apache.cassandra.db.IColumn; +import org.apache.cassandra.db.SuperColumn; +import org.apache.cassandra.hadoop.*; +import org.apache.cassandra.thrift.SlicePredicate; +import org.apache.cassandra.thrift.SliceRange; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapreduce.InputFormat; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.RecordReader; + +import org.apache.pig.LoadFunc; +import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit; +import org.apache.pig.data.DefaultDataBag; +import org.apache.pig.data.DataByteArray; +import org.apache.pig.data.Tuple; +import org.apache.pig.data.TupleFactory; + +/** + * A LoadFunc wrapping ColumnFamilyInputFormat. + * + * A row from a standard CF will be returned as nested tuples: (key, ((name1, val1), (name2, val2))). + */ +public class CassandraStorage extends LoadFunc +{ + private final static byte[] BOUND = new byte[0]; + private final static int LIMIT = 1024; + + private Configuration conf; + private RecordReader reader; + + @Override + public Tuple getNext() throws IOException + { + try + { + // load the next pair + if (!reader.nextKeyValue()) + return null; + String key = (String)reader.getCurrentKey(); + SortedMap<byte[],IColumn> cf = (SortedMap<byte[],IColumn>)reader.getCurrentValue(); + assert key != null && cf != null; + + // and wrap it in a tuple + Tuple tuple = TupleFactory.getInstance().newTuple(2); + ArrayList<Tuple> columns = new ArrayList<Tuple>(); + tuple.set(0, new DataByteArray(key)); + for (Map.Entry<byte[], IColumn> entry : cf.entrySet()) + columns.add(columnToTuple(entry.getKey(), entry.getValue())); + tuple.set(1, new DefaultDataBag(columns)); + return tuple; + } + catch (InterruptedException e) + { + throw new IOException(e.getMessage()); + } + } + + private Tuple columnToTuple(byte[] name, IColumn col) throws IOException + { + Tuple pair = TupleFactory.getInstance().newTuple(2); + pair.set(0, new DataByteArray(name)); + if (col instanceof Column) + { + // standard + pair.set(1, new DataByteArray(col.value())); + return pair; + } + + // super + ArrayList<Tuple> subcols = new ArrayList<Tuple>(); + for (IColumn subcol : ((SuperColumn)col).getSubColumns()) + subcols.add(columnToTuple(subcol.name(), subcol)); + pair.set(1, new DefaultDataBag(subcols)); + return pair; + } + + @Override + public InputFormat getInputFormat() + { + ColumnFamilyInputFormat inputFormat = new ColumnFamilyInputFormat(); + return inputFormat; + } + + @Override + public void prepareToRead(RecordReader reader, PigSplit split) + { + this.reader = reader; + } + + @Override + public void setLocation(String location, Job job) throws IOException + { + // parse uri into keyspace and columnfamily + String ksname, cfname; + try + { + if (!location.startsWith("cassandra://")) + throw new Exception("Bad scheme."); + String[] parts = location.split("/+"); + ksname = parts[1]; + cfname = parts[2]; + } + catch (Exception e) + { + throw new IOException("Expected 'cassandra://<keyspace>/<columnfamily>': " + e.getMessage()); + } + + // and configure + SliceRange range = new SliceRange(BOUND, BOUND, false, LIMIT); + SlicePredicate predicate = new SlicePredicate().setSlice_range(range); + conf = job.getConfiguration(); + ConfigHelper.setSlicePredicate(conf, predicate); + ConfigHelper.setColumnFamily(conf, ksname, cfname); + } + + @Override + public String relativeToAbsolutePath(String location, Path curDir) throws IOException + { + return location; + } +} Propchange: cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java ------------------------------------------------------------------------------ svn:eol-style = native Added: cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml URL: http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml?rev=928264&view=auto ============================================================================== --- cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml (added) +++ cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml Sat Mar 27 20:33:03 2010 @@ -0,0 +1,369 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> +<Storage> + <!--======================================================================--> + <!-- Basic Configuration --> + <!--======================================================================--> + + <!-- + ~ The name of this cluster. This is mainly used to prevent machines in + ~ one logical cluster from joining another. + --> + <ClusterName>Test Cluster</ClusterName> + + <!-- + ~ Turn on to make new [non-seed] nodes automatically migrate the right data + ~ to themselves. (If no InitialToken is specified, they will pick one + ~ such that they will get half the range of the most-loaded node.) + ~ If a node starts up without bootstrapping, it will mark itself bootstrapped + ~ so that you can't subsequently accidently bootstrap a node with + ~ data on it. (You can reset this by wiping your data and commitlog + ~ directories.) + ~ + ~ Off by default so that new clusters and upgraders from 0.4 don't + ~ bootstrap immediately. You should turn this on when you start adding + ~ new nodes to a cluster that already has data on it. (If you are upgrading + ~ from 0.4, start your cluster with it off once before changing it to true. + ~ Otherwise, no data will be lost but you will incur a lot of unnecessary + ~ I/O before your cluster starts up.) + --> + <AutoBootstrap>false</AutoBootstrap> + + <!-- + ~ Keyspaces and ColumnFamilies: + ~ A ColumnFamily is the Cassandra concept closest to a relational + ~ table. Keyspaces are separate groups of ColumnFamilies. Except in + ~ very unusual circumstances you will have one Keyspace per application. + + ~ There is an implicit keyspace named 'system' for Cassandra internals. + --> + <Keyspaces> + <Keyspace Name="Keyspace1"> + <!-- + ~ ColumnFamily definitions have one required attribute (Name) + ~ and several optional ones. + ~ + ~ The CompareWith attribute tells Cassandra how to sort the columns + ~ for slicing operations. The default is BytesType, which is a + ~ straightforward lexical comparison of the bytes in each column. + ~ Other options are AsciiType, UTF8Type, LexicalUUIDType, TimeUUIDType, + ~ and LongType. You can also specify the fully-qualified class + ~ name to a class of your choice extending + ~ org.apache.cassandra.db.marshal.AbstractType. + ~ + ~ SuperColumns have a similar CompareSubcolumnsWith attribute. + ~ + ~ BytesType: Simple sort by byte value. No validation is performed. + ~ AsciiType: Like BytesType, but validates that the input can be + ~ parsed as US-ASCII. + ~ UTF8Type: A string encoded as UTF8 + ~ LongType: A 64bit long + ~ LexicalUUIDType: A 128bit UUID, compared lexically (by byte value) + ~ TimeUUIDType: a 128bit version 1 UUID, compared by timestamp + ~ + ~ (To get the closest approximation to 0.3-style supercolumns, you + ~ would use CompareWith=UTF8Type CompareSubcolumnsWith=LongType.) + ~ + ~ An optional `Comment` attribute may be used to attach additional + ~ human-readable information about the column family to its definition. + ~ + ~ The optional KeysCachedFraction attribute specifies + ~ The fraction of keys per sstable whose locations we keep in + ~ memory in "mostly LRU" order. (JUST the key locations, NOT any + ~ column values.) The amount of memory used by the default setting of + ~ 0.01 is comparable to the amount used by the internal per-sstable key + ~ index. Consider increasing this if you have fewer, wider rows. + ~ Set to 0 to disable entirely. + ~ + ~ The optional RowsCached attribute specifies the number of rows + ~ whose entire contents we cache in memory, either as a fixed number + ~ of rows or as a percent of rows in the ColumnFamily. + ~ Do not use this on ColumnFamilies with large rows, or + ~ ColumnFamilies with high write:read ratios. As with key caching, + ~ valid values are from 0 to 1. The default 0 disables it entirely. + --> + <ColumnFamily CompareWith="BytesType" + Name="Standard1" + RowsCached="10%" + KeysCachedFraction="0"/> + <ColumnFamily CompareWith="UTF8Type" Name="Standard2"/> + <ColumnFamily CompareWith="TimeUUIDType" Name="StandardByUUID1"/> + <ColumnFamily ColumnType="Super" + CompareWith="UTF8Type" + CompareSubcolumnsWith="UTF8Type" + Name="Super1" + RowsCached="1000" + KeysCachedFraction="0" + Comment="A column family with supercolumns, whose column and subcolumn names are UTF8 strings"/> + + <!-- + ~ Strategy: Setting this to the class that implements + ~ IReplicaPlacementStrategy will change the way the node picker works. + ~ Out of the box, Cassandra provides + ~ org.apache.cassandra.locator.RackUnawareStrategy and + ~ org.apache.cassandra.locator.RackAwareStrategy (place one replica in + ~ a different datacenter, and the others on different racks in the same + ~ one.) + --> + <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy> + + <!-- Number of replicas of the data --> + <ReplicationFactor>1</ReplicationFactor> + + <!-- + ~ EndPointSnitch: Setting this to the class that implements + ~ AbstractEndpointSnitch, which lets Cassandra know enough + ~ about your network topology to route requests efficiently. + ~ Out of the box, Cassandra provides org.apache.cassandra.locator.EndPointSnitch, + ~ and PropertyFileEndPointSnitch is available in contrib/. + --> + <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch> + </Keyspace> + </Keyspaces> + + <!-- + ~ Authenticator: any IAuthenticator may be used, including your own as long + ~ as it is on the classpath. Out of the box, Cassandra provides + ~ org.apache.cassandra.auth.AllowAllAuthenticator and, + ~ org.apache.cassandra.auth.SimpleAuthenticator + ~ (SimpleAuthenticator uses access.properties and passwd.properties by + ~ default). + ~ + ~ If you don't specify an authenticator, AllowAllAuthenticator is used. + --> + <Authenticator>org.apache.cassandra.auth.AllowAllAuthenticator</Authenticator> + + <!-- + ~ Partitioner: any IPartitioner may be used, including your own as long + ~ as it is on the classpath. Out of the box, Cassandra provides + ~ org.apache.cassandra.dht.RandomPartitioner, + ~ org.apache.cassandra.dht.OrderPreservingPartitioner, and + ~ org.apache.cassandra.dht.CollatingOrderPreservingPartitioner. + ~ (CollatingOPP colates according to EN,US rules, not naive byte + ~ ordering. Use this as an example if you need locale-aware collation.) + ~ Range queries require using an order-preserving partitioner. + ~ + ~ Achtung! Changing this parameter requires wiping your data + ~ directories, since the partitioner can modify the sstable on-disk + ~ format. + --> + <Partitioner>org.apache.cassandra.dht.RandomPartitioner</Partitioner> + + <!-- + ~ If you are using an order-preserving partitioner and you know your key + ~ distribution, you can specify the token for this node to use. (Keys + ~ are sent to the node with the "closest" token, so distributing your + ~ tokens equally along the key distribution space will spread keys + ~ evenly across your cluster.) This setting is only checked the first + ~ time a node is started. + + ~ This can also be useful with RandomPartitioner to force equal spacing + ~ of tokens around the hash space, especially for clusters with a small + ~ number of nodes. + --> + <InitialToken></InitialToken> + + <!-- + ~ Directories: Specify where Cassandra should store different data on + ~ disk. Keep the data disks and the CommitLog disks separate for best + ~ performance + --> + <CommitLogDirectory>/var/lib/cassandra/commitlog</CommitLogDirectory> + <DataFileDirectories> + <DataFileDirectory>/var/lib/cassandra/data</DataFileDirectory> + </DataFileDirectories> + <CalloutLocation>/var/lib/cassandra/callouts</CalloutLocation> + <StagingFileDirectory>/var/lib/cassandra/staging</StagingFileDirectory> + + + <!-- + ~ Addresses of hosts that are deemed contact points. Cassandra nodes + ~ use this list of hosts to find each other and learn the topology of + ~ the ring. You must change this if you are running multiple nodes! + --> + <Seeds> + <Seed>127.0.0.1</Seed> + </Seeds> + + + <!-- Miscellaneous --> + + <!-- Time to wait for a reply from other nodes before failing the command --> + <RpcTimeoutInMillis>5000</RpcTimeoutInMillis> + <!-- Size to allow commitlog to grow to before creating a new segment --> + <CommitLogRotationThresholdInMB>128</CommitLogRotationThresholdInMB> + + + <!-- Local hosts and ports --> + + <!-- + ~ Address to bind to and tell other nodes to connect to. You _must_ + ~ change this if you want multiple nodes to be able to communicate! + ~ + ~ Leaving it blank leaves it up to InetAddress.getLocalHost(). This + ~ will always do the Right Thing *if* the node is properly configured + ~ (hostname, name resolution, etc), and the Right Thing is to use the + ~ address associated with the hostname (it might not be). + --> + <ListenAddress>127.0.0.2</ListenAddress> + <!-- internal communications port --> + <StoragePort>7000</StoragePort> + + <!-- + ~ The address to bind the Thrift RPC service to. Unlike ListenAddress + ~ above, you *can* specify 0.0.0.0 here if you want Thrift to listen on + ~ all interfaces. + ~ + ~ Leaving this blank has the same effect it does for ListenAddress, + ~ (i.e. it will be based on the configured hostname of the node). + --> + <ThriftAddress>127.0.0.2</ThriftAddress> + <!-- Thrift RPC port (the port clients connect to). --> + <ThriftPort>9160</ThriftPort> + <!-- + ~ Whether or not to use a framed transport for Thrift. If this option + ~ is set to true then you must also use a framed transport on the + ~ client-side, (framed and non-framed transports are not compatible). + --> + <ThriftFramedTransport>false</ThriftFramedTransport> + + + <!--======================================================================--> + <!-- Memory, Disk, and Performance --> + <!--======================================================================--> + + <!-- + ~ Access mode. mmapped i/o is substantially faster, but only practical on + ~ a 64bit machine (which notably does not include EC2 "small" instances) + ~ or relatively small datasets. "auto", the safe choice, will enable + ~ mmapping on a 64bit JVM. Other values are "mmap", "mmap_index_only" + ~ (which may allow you to get part of the benefits of mmap on a 32bit + ~ machine by mmapping only index files) and "standard". + ~ (The buffer size settings that follow only apply to standard, + ~ non-mmapped i/o.) + --> + <DiskAccessMode>auto</DiskAccessMode> + + <!-- + ~ Buffer size to use when performing contiguous column slices. Increase + ~ this to the size of the column slices you typically perform. + ~ (Name-based queries are performed with a buffer size of + ~ ColumnIndexSizeInKB.) + --> + <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB> + + <!-- + ~ Buffer size to use when flushing memtables to disk. (Only one + ~ memtable is ever flushed at a time.) Increase (decrease) the index + ~ buffer size relative to the data buffer if you have few (many) + ~ columns per key. Bigger is only better _if_ your memtables get large + ~ enough to use the space. (Check in your data directory after your + ~ app has been running long enough.) --> + <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB> + <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB> + + <!-- + ~ Add column indexes to a row after its contents reach this size. + ~ Increase if your column values are large, or if you have a very large + ~ number of columns. The competing causes are, Cassandra has to + ~ deserialize this much of the row to read a single column, so you want + ~ it to be small - at least if you do many partial-row reads - but all + ~ the index data is read for each access, so you don't want to generate + ~ that wastefully either. + --> + <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB> + + <!-- + ~ Flush memtable after this much data has been inserted, including + ~ overwritten data. There is one memtable per column family, and + ~ this threshold is based solely on the amount of data stored, not + ~ actual heap memory usage (there is some overhead in indexing the + ~ columns). + --> + <MemtableThroughputInMB>64</MemtableThroughputInMB> + <!-- + ~ Throughput setting for Binary Memtables. Typically these are + ~ used for bulk load so you want them to be larger. + --> + <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB> + <!-- + ~ The maximum number of columns in millions to store in memory per + ~ ColumnFamily before flushing to disk. This is also a per-memtable + ~ setting. Use with MemtableThroughputInMB to tune memory usage. + --> + <MemtableOperationsInMillions>0.1</MemtableOperationsInMillions> + <!-- + ~ The maximum time to leave a dirty memtable unflushed. + ~ (While any affected columnfamilies have unflushed data from a + ~ commit log segment, that segment cannot be deleted.) + ~ This needs to be large enough that it won't cause a flush storm + ~ of all your memtables flushing at once because none has hit + ~ the size or count thresholds yet. For production, a larger + ~ value such as 1440 is recommended. + --> + <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes> + + <!-- + ~ Unlike most systems, in Cassandra writes are faster than reads, so + ~ you can afford more of those in parallel. A good rule of thumb is 2 + ~ concurrent reads per processor core. Increase ConcurrentWrites to + ~ the number of clients writing at once if you enable CommitLogSync + + ~ CommitLogSyncDelay. --> + <ConcurrentReads>8</ConcurrentReads> + <ConcurrentWrites>32</ConcurrentWrites> + + <!-- + ~ CommitLogSync may be either "periodic" or "batch." When in batch + ~ mode, Cassandra won't ack writes until the commit log has been + ~ fsynced to disk. It will wait up to CommitLogSyncBatchWindowInMS + ~ milliseconds for other writes, before performing the sync. + + ~ This is less necessary in Cassandra than in traditional databases + ~ since replication reduces the odds of losing data from a failure + ~ after writing the log entry but before it actually reaches the disk. + ~ So the other option is "timed," where writes may be acked immediately + ~ and the CommitLog is simply synced every CommitLogSyncPeriodInMS + ~ milliseconds. + --> + <CommitLogSync>periodic</CommitLogSync> + <!-- + ~ Interval at which to perform syncs of the CommitLog in periodic mode. + ~ Usually the default of 10000ms is fine; increase it if your i/o + ~ load is such that syncs are taking excessively long times. + --> + <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS> + <!-- + ~ Delay (in milliseconds) during which additional commit log entries + ~ may be written before fsync in batch mode. This will increase + ~ latency slightly, but can vastly improve throughput where there are + ~ many writers. Set to zero to disable (each entry will be synced + ~ individually). Reasonable values range from a minimal 0.1 to 10 or + ~ even more if throughput matters more than latency. + --> + <!-- <CommitLogSyncBatchWindowInMS>1</CommitLogSyncBatchWindowInMS> --> + + <!-- + ~ Time to wait before garbage-collection deletion markers. Set this to + ~ a large enough value that you are confident that the deletion marker + ~ will be propagated to all replicas by the time this many seconds has + ~ elapsed, even in the face of hardware failures. The default value is + ~ ten days. + --> + <GCGraceSeconds>864000</GCGraceSeconds> +</Storage> Propchange: cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml ------------------------------------------------------------------------------ svn:eol-style = native