Author: jbellis
Date: Sat Mar 27 20:33:03 2010
New Revision: 928264

URL: http://svn.apache.org/viewvc?rev=928264&view=rev
Log:
add pig loadfunc to contrib.  patch by Stu Hood; reviewed by jbellis for 
CASSANDRA-910

Added:
    cassandra/branches/cassandra-0.6/contrib/pig/
    cassandra/branches/cassandra-0.6/contrib/pig/README.txt   (with props)
    cassandra/branches/cassandra-0.6/contrib/pig/bin/
    cassandra/branches/cassandra-0.6/contrib/pig/bin/pig_cassandra
    cassandra/branches/cassandra-0.6/contrib/pig/build.xml   (with props)
    cassandra/branches/cassandra-0.6/contrib/pig/src/
    cassandra/branches/cassandra-0.6/contrib/pig/src/java/
    cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/
    cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/
    cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/
    
cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/
    
cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/
    
cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
   (with props)
    cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml   (with props)
Modified:
    cassandra/branches/cassandra-0.6/CHANGES.txt

Modified: cassandra/branches/cassandra-0.6/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/CHANGES.txt?rev=928264&r1=928263&r2=928264&view=diff
==============================================================================
--- cassandra/branches/cassandra-0.6/CHANGES.txt (original)
+++ cassandra/branches/cassandra-0.6/CHANGES.txt Sat Mar 27 20:33:03 2010
@@ -13,6 +13,7 @@
    to top level supercolumns" (CASSANDRA-834)
  * Streaming destination nodes do not update their JMX status (CASSANDRA-916)
  * Fix internal RPC timeout calculation (CASSANDRA-911)
+ * Added Pig loadfunc to contrib/pig (CASSANDRA-910)
 
 
 0.6.0-beta3

Added: cassandra/branches/cassandra-0.6/contrib/pig/README.txt
URL: 
http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/contrib/pig/README.txt?rev=928264&view=auto
==============================================================================
--- cassandra/branches/cassandra-0.6/contrib/pig/README.txt (added)
+++ cassandra/branches/cassandra-0.6/contrib/pig/README.txt Sat Mar 27 20:33:03 
2010
@@ -0,0 +1,29 @@
+A Pig LoadFunc that reads all columns from a given ColumnFamily.
+
+Setup:
+
+First build and start a Cassandra server with the default
+configuration* and set the PIG_HOME and JAVA_HOME environment
+variables to the location of a Pig >= 0.7.0-dev install and your Java
+install. If you would like to run using the Hadoop backend, you should
+also set PIG_CONF_DIR to the location of your Hadoop config.
+
+Run:
+
+contrib/pig$ ant
+contrib/pig$ bin/pig_cassandra
+
+Once the 'grunt>' shell has loaded, try a simple program like the
+following, which will determine the top 50 column names:
+
+grunt> rows = LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage();
+grunt> cols = FOREACH rows GENERATE flatten($1);
+grunt> colnames = FOREACH cols GENERATE $0;
+grunt> namegroups = GROUP colnames BY $0;
+grunt> namecounts = FOREACH namegroups GENERATE COUNT($1), group;
+grunt> orderednames = ORDER namecounts BY $0;
+grunt> topnames = LIMIT orderednames 50;
+grunt> dump topnames;
+
+*If you want to point Pig at a real cluster, modify the seed
+address in storage-conf.xml and re-run the build step.

Propchange: cassandra/branches/cassandra-0.6/contrib/pig/README.txt
------------------------------------------------------------------------------
    svn:eol-style = native

Added: cassandra/branches/cassandra-0.6/contrib/pig/bin/pig_cassandra
URL: 
http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/contrib/pig/bin/pig_cassandra?rev=928264&view=auto
==============================================================================
--- cassandra/branches/cassandra-0.6/contrib/pig/bin/pig_cassandra (added)
+++ cassandra/branches/cassandra-0.6/contrib/pig/bin/pig_cassandra Sat Mar 27 
20:33:03 2010
@@ -0,0 +1,50 @@
+#!/bin/sh
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+cwd=`dirname $0`
+cassandra_home="$cwd/../../../"
+
+# general jars.
+for jar in $cassandra_home/lib/*.jar $cassandra_home/build/lib/jars/*.jar; do
+    CLASSPATH=$CLASSPATH:$jar
+done
+
+# cassandra_loadfunc jar.
+LOADFUNC_JAR=`ls -1 $cwd/../build/*.jar`
+if [ ! -e $LOADFUNC_JAR ]; then
+    echo "Unable to locate cassandra_loadfunc jar: please run ant." >&2
+    exit 1
+fi
+CLASSPATH=$CLASSPATH:$LOADFUNC_JAR
+
+if [ "x$PIG_HOME" = "x" ]; then
+    echo "PIG_HOME not set: requires Pig >= 0.7.0-dev" >&2
+    exit 1
+fi
+
+# pig jar.
+PIG_JAR=$PIG_HOME/pig.jar
+if [ ! -e $PIG_JAR ]; then
+    echo "Unable to locate Pig jar" >&2
+    exit 1
+fi
+CLASSPATH=$CLASSPATH:$PIG_JAR
+
+export PIG_CLASSPATH=$PIG_CLASSPATH:$CLASSPATH
+export PIG_OPTS=$PIG_OPTS" -Dudf.import.list=org.apache.cassandra.hadoop.pig"
+cat "$cwd/../build/bootstrap.pig" - | $PIG_HOME/bin/pig $*

Added: cassandra/branches/cassandra-0.6/contrib/pig/build.xml
URL: 
http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/contrib/pig/build.xml?rev=928264&view=auto
==============================================================================
--- cassandra/branches/cassandra-0.6/contrib/pig/build.xml (added)
+++ cassandra/branches/cassandra-0.6/contrib/pig/build.xml Sat Mar 27 20:33:03 
2010
@@ -0,0 +1,74 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements.  See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership.  The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License.  You may obtain a copy of the License at
+ ~
+ ~    http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied.  See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+<project basedir="." default="jar" name="cassandra_loadfunc">
+    <!-- stores the environment for locating PIG_HOME -->
+    <property environment="env" />
+    <property name="cassandra.dir" value="../.." />
+    <property name="cassandra.lib" value="" />
+    <property name="cassandra.classes" value="${cassandra.dir}/build/classes" 
/>
+    <property name="build.src" value="${basedir}/src" />
+    <property name="build.lib" value="${basedir}/lib" />
+    <property name="build.out" value="${basedir}/build" />
+    <property name="build.classes" value="${build.out}/classes" />
+    <property name="final.name" value="cassandra_loadfunc" />
+
+    <path id="pig.classpath">
+        <pathelement location="${env.PIG_HOME}/pig.jar" />
+        <fileset dir="${cassandra.dir}/lib">
+            <include name="libthrift*.jar" />
+        </fileset>
+        <fileset dir="${cassandra.dir}/build/lib/jars">
+            <include name="google-collections*.jar" />
+        </fileset>
+    </path>
+
+    <path id="classpath">
+        <path refid="pig.classpath" />
+        <pathelement location="${cassandra.classes}" />
+    </path>
+
+    <target name="init">
+        <mkdir dir="${build.classes}" />
+    </target>
+
+    <target depends="init" name="build">
+        <fail unless="env.PIG_HOME" message="Please set PIG_HOME to the 
location of a Pig >= 0.7.0-dev install." />
+        <javac destdir="${build.classes}">
+            <src path="${build.src}" />
+            <classpath refid="classpath" />
+        </javac>
+        <!-- Build a line of jar registrations for use in the pig startup 
script -->
+        <pathconvert pathsep="; register " property="register.line" 
refid="pig.classpath" />
+        <echo message="register ${register.line};${line.separator}" 
file="${build.out}/bootstrap.pig" />
+    </target>
+
+    <target name="jar" depends="build">
+        <mkdir dir="${build.classes}/META-INF" />
+        <jar jarfile="${build.out}/${final.name}.jar">
+           <fileset dir="${build.classes}" />
+           <fileset dir="${cassandra.classes}" />
+           <fileset file="${basedir}/storage-conf.xml" />
+        </jar>
+    </target>
+
+    <target name="clean">
+        <delete dir="${build.out}" />
+    </target>
+</project>

Propchange: cassandra/branches/cassandra-0.6/contrib/pig/build.xml
------------------------------------------------------------------------------
    svn:eol-style = native

Added: 
cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
URL: 
http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?rev=928264&view=auto
==============================================================================
--- 
cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
 (added)
+++ 
cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
 Sat Mar 27 20:33:03 2010
@@ -0,0 +1,145 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership. The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations 
under
+ * the License.
+ */
+package org.apache.cassandra.hadoop.pig;
+
+import java.io.IOException;
+import java.util.*;
+
+import org.apache.cassandra.db.Column;
+import org.apache.cassandra.db.IColumn;
+import org.apache.cassandra.db.SuperColumn;
+import org.apache.cassandra.hadoop.*;
+import org.apache.cassandra.thrift.SlicePredicate;
+import org.apache.cassandra.thrift.SliceRange;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.mapreduce.InputFormat;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.RecordReader;
+
+import org.apache.pig.LoadFunc;
+import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
+import org.apache.pig.data.DefaultDataBag;
+import org.apache.pig.data.DataByteArray;
+import org.apache.pig.data.Tuple;
+import org.apache.pig.data.TupleFactory;
+
+/**
+ * A LoadFunc wrapping ColumnFamilyInputFormat.
+ *
+ * A row from a standard CF will be returned as nested tuples: (key, ((name1, 
val1), (name2, val2))).
+ */
+public class CassandraStorage extends LoadFunc
+{
+    private final static byte[] BOUND = new byte[0];
+    private final static int LIMIT = 1024;
+
+    private Configuration conf;
+    private RecordReader reader;
+
+    @Override
+    public Tuple getNext() throws IOException
+    {
+        try
+        {
+            // load the next pair
+            if (!reader.nextKeyValue())
+                return null;
+            String key = (String)reader.getCurrentKey();
+            SortedMap<byte[],IColumn> cf = 
(SortedMap<byte[],IColumn>)reader.getCurrentValue();
+            assert key != null && cf != null;
+            
+            // and wrap it in a tuple
+                   Tuple tuple = TupleFactory.getInstance().newTuple(2);
+            ArrayList<Tuple> columns = new ArrayList<Tuple>();
+            tuple.set(0, new DataByteArray(key));
+            for (Map.Entry<byte[], IColumn> entry : cf.entrySet())
+                columns.add(columnToTuple(entry.getKey(), entry.getValue()));
+            tuple.set(1, new DefaultDataBag(columns));
+            return tuple;
+        }
+        catch (InterruptedException e)
+        {
+            throw new IOException(e.getMessage());
+        }
+    }
+
+    private Tuple columnToTuple(byte[] name, IColumn col) throws IOException
+    {
+        Tuple pair = TupleFactory.getInstance().newTuple(2);
+        pair.set(0, new DataByteArray(name));
+        if (col instanceof Column)
+        {
+            // standard
+            pair.set(1, new DataByteArray(col.value()));
+            return pair;
+        }
+
+        // super
+        ArrayList<Tuple> subcols = new ArrayList<Tuple>();
+        for (IColumn subcol : ((SuperColumn)col).getSubColumns())
+            subcols.add(columnToTuple(subcol.name(), subcol));
+        pair.set(1, new DefaultDataBag(subcols));
+        return pair;
+    }
+
+    @Override
+    public InputFormat getInputFormat()
+    {
+        ColumnFamilyInputFormat inputFormat = new ColumnFamilyInputFormat();
+        return inputFormat;
+    }
+
+    @Override
+    public void prepareToRead(RecordReader reader, PigSplit split)
+    {
+        this.reader = reader;
+    }
+
+    @Override
+    public void setLocation(String location, Job job) throws IOException
+    {
+        // parse uri into keyspace and columnfamily
+        String ksname, cfname;
+        try
+        {
+            if (!location.startsWith("cassandra://"))
+                throw new Exception("Bad scheme.");
+            String[] parts = location.split("/+");
+            ksname = parts[1];
+            cfname = parts[2];
+        }
+        catch (Exception e)
+        {
+            throw new IOException("Expected 
'cassandra://<keyspace>/<columnfamily>': " + e.getMessage());
+        }
+
+        // and configure
+        SliceRange range = new SliceRange(BOUND, BOUND, false, LIMIT);
+        SlicePredicate predicate = new SlicePredicate().setSlice_range(range);
+        conf = job.getConfiguration();
+        ConfigHelper.setSlicePredicate(conf, predicate);
+        ConfigHelper.setColumnFamily(conf, ksname, cfname);
+    }
+
+    @Override
+    public String relativeToAbsolutePath(String location, Path curDir) throws 
IOException
+    {
+        return location;
+    }
+}

Propchange: 
cassandra/branches/cassandra-0.6/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
------------------------------------------------------------------------------
    svn:eol-style = native

Added: cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml
URL: 
http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml?rev=928264&view=auto
==============================================================================
--- cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml (added)
+++ cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml Sat Mar 27 
20:33:03 2010
@@ -0,0 +1,369 @@
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements.  See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership.  The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License.  You may obtain a copy of the License at
+ ~
+ ~    http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied.  See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+-->
+<Storage>
+  <!--======================================================================-->
+  <!-- Basic Configuration                                                  -->
+  <!--======================================================================-->
+
+  <!-- 
+   ~ The name of this cluster.  This is mainly used to prevent machines in
+   ~ one logical cluster from joining another.
+  -->
+  <ClusterName>Test Cluster</ClusterName>
+
+  <!--
+   ~ Turn on to make new [non-seed] nodes automatically migrate the right data 
+   ~ to themselves.  (If no InitialToken is specified, they will pick one 
+   ~ such that they will get half the range of the most-loaded node.)
+   ~ If a node starts up without bootstrapping, it will mark itself 
bootstrapped
+   ~ so that you can't subsequently accidently bootstrap a node with
+   ~ data on it.  (You can reset this by wiping your data and commitlog
+   ~ directories.)
+   ~
+   ~ Off by default so that new clusters and upgraders from 0.4 don't
+   ~ bootstrap immediately.  You should turn this on when you start adding
+   ~ new nodes to a cluster that already has data on it.  (If you are upgrading
+   ~ from 0.4, start your cluster with it off once before changing it to true.
+   ~ Otherwise, no data will be lost but you will incur a lot of unnecessary
+   ~ I/O before your cluster starts up.)
+  -->
+  <AutoBootstrap>false</AutoBootstrap>
+
+  <!--
+   ~ Keyspaces and ColumnFamilies:
+   ~ A ColumnFamily is the Cassandra concept closest to a relational
+   ~ table.  Keyspaces are separate groups of ColumnFamilies.  Except in
+   ~ very unusual circumstances you will have one Keyspace per application.
+
+   ~ There is an implicit keyspace named 'system' for Cassandra internals.
+  -->
+  <Keyspaces>
+    <Keyspace Name="Keyspace1">
+      <!--
+       ~ ColumnFamily definitions have one required attribute (Name)
+       ~ and several optional ones.
+       ~
+       ~ The CompareWith attribute tells Cassandra how to sort the columns
+       ~ for slicing operations.  The default is BytesType, which is a
+       ~ straightforward lexical comparison of the bytes in each column.
+       ~ Other options are AsciiType, UTF8Type, LexicalUUIDType, TimeUUIDType,
+       ~ and LongType.  You can also specify the fully-qualified class
+       ~ name to a class of your choice extending
+       ~ org.apache.cassandra.db.marshal.AbstractType.
+       ~ 
+       ~ SuperColumns have a similar CompareSubcolumnsWith attribute.
+       ~ 
+       ~ BytesType: Simple sort by byte value.  No validation is performed.
+       ~ AsciiType: Like BytesType, but validates that the input can be 
+       ~            parsed as US-ASCII.
+       ~ UTF8Type: A string encoded as UTF8
+       ~ LongType: A 64bit long
+       ~ LexicalUUIDType: A 128bit UUID, compared lexically (by byte value)
+       ~ TimeUUIDType: a 128bit version 1 UUID, compared by timestamp
+       ~
+       ~ (To get the closest approximation to 0.3-style supercolumns, you
+       ~ would use CompareWith=UTF8Type CompareSubcolumnsWith=LongType.)
+       ~
+       ~ An optional `Comment` attribute may be used to attach additional
+       ~ human-readable information about the column family to its definition.
+       ~ 
+       ~ The optional KeysCachedFraction attribute specifies
+       ~ The fraction of keys per sstable whose locations we keep in
+       ~ memory in "mostly LRU" order.  (JUST the key locations, NOT any
+       ~ column values.) The amount of memory used by the default setting of 
+       ~ 0.01 is comparable to the amount used by the internal per-sstable key
+       ~ index. Consider increasing this if you have fewer, wider rows.
+       ~ Set to 0 to disable entirely.
+       ~
+       ~ The optional RowsCached attribute specifies the number of rows
+       ~ whose entire contents we cache in memory, either as a fixed number
+       ~ of rows or as a percent of rows in the ColumnFamily.  
+       ~ Do not use this on ColumnFamilies with large rows, or
+       ~ ColumnFamilies with high write:read ratios.  As with key caching,
+       ~ valid values are from 0 to 1.  The default 0 disables it entirely.
+      -->
+      <ColumnFamily CompareWith="BytesType" 
+                    Name="Standard1" 
+                    RowsCached="10%"
+                    KeysCachedFraction="0"/>
+      <ColumnFamily CompareWith="UTF8Type" Name="Standard2"/>
+      <ColumnFamily CompareWith="TimeUUIDType" Name="StandardByUUID1"/>
+      <ColumnFamily ColumnType="Super"
+                    CompareWith="UTF8Type"
+                    CompareSubcolumnsWith="UTF8Type"
+                    Name="Super1"
+                    RowsCached="1000"
+                    KeysCachedFraction="0"
+                    Comment="A column family with supercolumns, whose column 
and subcolumn names are UTF8 strings"/>
+
+      <!--
+       ~ Strategy: Setting this to the class that implements
+       ~ IReplicaPlacementStrategy will change the way the node picker works.
+       ~ Out of the box, Cassandra provides
+       ~ org.apache.cassandra.locator.RackUnawareStrategy and
+       ~ org.apache.cassandra.locator.RackAwareStrategy (place one replica in
+       ~ a different datacenter, and the others on different racks in the same
+       ~ one.)
+      -->
+      
<ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
+
+      <!-- Number of replicas of the data -->
+      <ReplicationFactor>1</ReplicationFactor>
+
+      <!--
+       ~ EndPointSnitch: Setting this to the class that implements
+       ~ AbstractEndpointSnitch, which lets Cassandra know enough
+       ~ about your network topology to route requests efficiently.
+       ~ Out of the box, Cassandra provides 
org.apache.cassandra.locator.EndPointSnitch,
+       ~ and PropertyFileEndPointSnitch is available in contrib/.
+      -->
+      
<EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
+    </Keyspace>
+  </Keyspaces>
+
+  <!--
+   ~ Authenticator: any IAuthenticator may be used, including your own as long
+   ~ as it is on the classpath.  Out of the box, Cassandra provides
+   ~ org.apache.cassandra.auth.AllowAllAuthenticator and,
+   ~ org.apache.cassandra.auth.SimpleAuthenticator 
+   ~ (SimpleAuthenticator uses access.properties and passwd.properties by
+   ~ default).
+   ~
+   ~ If you don't specify an authenticator, AllowAllAuthenticator is used.
+  -->
+  
<Authenticator>org.apache.cassandra.auth.AllowAllAuthenticator</Authenticator>
+
+  <!--
+   ~ Partitioner: any IPartitioner may be used, including your own as long
+   ~ as it is on the classpath.  Out of the box, Cassandra provides
+   ~ org.apache.cassandra.dht.RandomPartitioner,
+   ~ org.apache.cassandra.dht.OrderPreservingPartitioner, and
+   ~ org.apache.cassandra.dht.CollatingOrderPreservingPartitioner.
+   ~ (CollatingOPP colates according to EN,US rules, not naive byte
+   ~ ordering.  Use this as an example if you need locale-aware collation.)
+   ~ Range queries require using an order-preserving partitioner.
+   ~
+   ~ Achtung!  Changing this parameter requires wiping your data
+   ~ directories, since the partitioner can modify the sstable on-disk
+   ~ format.
+  -->
+  <Partitioner>org.apache.cassandra.dht.RandomPartitioner</Partitioner>
+
+  <!--
+   ~ If you are using an order-preserving partitioner and you know your key
+   ~ distribution, you can specify the token for this node to use. (Keys
+   ~ are sent to the node with the "closest" token, so distributing your
+   ~ tokens equally along the key distribution space will spread keys
+   ~ evenly across your cluster.)  This setting is only checked the first
+   ~ time a node is started. 
+
+   ~ This can also be useful with RandomPartitioner to force equal spacing
+   ~ of tokens around the hash space, especially for clusters with a small
+   ~ number of nodes.
+  -->
+  <InitialToken></InitialToken>
+
+  <!--
+   ~ Directories: Specify where Cassandra should store different data on
+   ~ disk.  Keep the data disks and the CommitLog disks separate for best
+   ~ performance
+  -->
+  <CommitLogDirectory>/var/lib/cassandra/commitlog</CommitLogDirectory>
+  <DataFileDirectories>
+      <DataFileDirectory>/var/lib/cassandra/data</DataFileDirectory>
+  </DataFileDirectories>
+  <CalloutLocation>/var/lib/cassandra/callouts</CalloutLocation>
+  <StagingFileDirectory>/var/lib/cassandra/staging</StagingFileDirectory>
+
+
+  <!--
+   ~ Addresses of hosts that are deemed contact points. Cassandra nodes
+   ~ use this list of hosts to find each other and learn the topology of
+   ~ the ring. You must change this if you are running multiple nodes!
+  -->
+  <Seeds>
+      <Seed>127.0.0.1</Seed>
+  </Seeds>
+
+
+  <!-- Miscellaneous -->
+
+  <!-- Time to wait for a reply from other nodes before failing the command -->
+  <RpcTimeoutInMillis>5000</RpcTimeoutInMillis>
+  <!-- Size to allow commitlog to grow to before creating a new segment -->
+  <CommitLogRotationThresholdInMB>128</CommitLogRotationThresholdInMB>
+
+
+  <!-- Local hosts and ports -->
+
+  <!-- 
+   ~ Address to bind to and tell other nodes to connect to.  You _must_
+   ~ change this if you want multiple nodes to be able to communicate!  
+   ~
+   ~ Leaving it blank leaves it up to InetAddress.getLocalHost(). This
+   ~ will always do the Right Thing *if* the node is properly configured
+   ~ (hostname, name resolution, etc), and the Right Thing is to use the
+   ~ address associated with the hostname (it might not be).
+  -->
+  <ListenAddress>127.0.0.2</ListenAddress>
+  <!-- internal communications port -->
+  <StoragePort>7000</StoragePort>
+
+  <!--
+   ~ The address to bind the Thrift RPC service to. Unlike ListenAddress
+   ~ above, you *can* specify 0.0.0.0 here if you want Thrift to listen on
+   ~ all interfaces.
+   ~
+   ~ Leaving this blank has the same effect it does for ListenAddress,
+   ~ (i.e. it will be based on the configured hostname of the node).
+  -->
+  <ThriftAddress>127.0.0.2</ThriftAddress>
+  <!-- Thrift RPC port (the port clients connect to). -->
+  <ThriftPort>9160</ThriftPort>
+  <!-- 
+   ~ Whether or not to use a framed transport for Thrift. If this option
+   ~ is set to true then you must also use a framed transport on the 
+   ~ client-side, (framed and non-framed transports are not compatible).
+  -->
+  <ThriftFramedTransport>false</ThriftFramedTransport>
+
+
+  <!--======================================================================-->
+  <!-- Memory, Disk, and Performance                                        -->
+  <!--======================================================================-->
+
+  <!--
+   ~ Access mode.  mmapped i/o is substantially faster, but only practical on
+   ~ a 64bit machine (which notably does not include EC2 "small" instances)
+   ~ or relatively small datasets.  "auto", the safe choice, will enable
+   ~ mmapping on a 64bit JVM.  Other values are "mmap", "mmap_index_only"
+   ~ (which may allow you to get part of the benefits of mmap on a 32bit
+   ~ machine by mmapping only index files) and "standard".
+   ~ (The buffer size settings that follow only apply to standard,
+   ~ non-mmapped i/o.)
+   -->
+  <DiskAccessMode>auto</DiskAccessMode>
+
+  <!--
+   ~ Buffer size to use when performing contiguous column slices. Increase
+   ~ this to the size of the column slices you typically perform. 
+   ~ (Name-based queries are performed with a buffer size of 
+   ~ ColumnIndexSizeInKB.)
+  -->
+  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
+
+  <!--
+   ~ Buffer size to use when flushing memtables to disk. (Only one 
+   ~ memtable is ever flushed at a time.) Increase (decrease) the index
+   ~ buffer size relative to the data buffer if you have few (many) 
+   ~ columns per key.  Bigger is only better _if_ your memtables get large
+   ~ enough to use the space. (Check in your data directory after your
+   ~ app has been running long enough.) -->
+  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
+  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
+
+  <!--
+   ~ Add column indexes to a row after its contents reach this size.
+   ~ Increase if your column values are large, or if you have a very large
+   ~ number of columns.  The competing causes are, Cassandra has to
+   ~ deserialize this much of the row to read a single column, so you want
+   ~ it to be small - at least if you do many partial-row reads - but all
+   ~ the index data is read for each access, so you don't want to generate
+   ~ that wastefully either.
+  -->
+  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
+
+  <!--
+   ~ Flush memtable after this much data has been inserted, including
+   ~ overwritten data.  There is one memtable per column family, and 
+   ~ this threshold is based solely on the amount of data stored, not
+   ~ actual heap memory usage (there is some overhead in indexing the
+   ~ columns).
+  -->
+  <MemtableThroughputInMB>64</MemtableThroughputInMB>
+  <!--
+   ~ Throughput setting for Binary Memtables.  Typically these are
+   ~ used for bulk load so you want them to be larger.
+  -->
+  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
+  <!--
+   ~ The maximum number of columns in millions to store in memory per
+   ~ ColumnFamily before flushing to disk.  This is also a per-memtable
+   ~ setting.  Use with MemtableThroughputInMB to tune memory usage.
+  -->
+  <MemtableOperationsInMillions>0.1</MemtableOperationsInMillions>
+  <!--
+   ~ The maximum time to leave a dirty memtable unflushed.
+   ~ (While any affected columnfamilies have unflushed data from a
+   ~ commit log segment, that segment cannot be deleted.)
+   ~ This needs to be large enough that it won't cause a flush storm
+   ~ of all your memtables flushing at once because none has hit
+   ~ the size or count thresholds yet.  For production, a larger
+   ~ value such as 1440 is recommended.
+  -->
+  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
+
+  <!--
+   ~ Unlike most systems, in Cassandra writes are faster than reads, so
+   ~ you can afford more of those in parallel.  A good rule of thumb is 2
+   ~ concurrent reads per processor core.  Increase ConcurrentWrites to
+   ~ the number of clients writing at once if you enable CommitLogSync +
+   ~ CommitLogSyncDelay. -->
+  <ConcurrentReads>8</ConcurrentReads>
+  <ConcurrentWrites>32</ConcurrentWrites>
+
+  <!--
+   ~ CommitLogSync may be either "periodic" or "batch."  When in batch
+   ~ mode, Cassandra won't ack writes until the commit log has been
+   ~ fsynced to disk.  It will wait up to CommitLogSyncBatchWindowInMS
+   ~ milliseconds for other writes, before performing the sync.
+
+   ~ This is less necessary in Cassandra than in traditional databases
+   ~ since replication reduces the odds of losing data from a failure
+   ~ after writing the log entry but before it actually reaches the disk.
+   ~ So the other option is "timed," where writes may be acked immediately
+   ~ and the CommitLog is simply synced every CommitLogSyncPeriodInMS
+   ~ milliseconds.
+  -->
+  <CommitLogSync>periodic</CommitLogSync>
+  <!--
+   ~ Interval at which to perform syncs of the CommitLog in periodic mode.
+   ~ Usually the default of 10000ms is fine; increase it if your i/o
+   ~ load is such that syncs are taking excessively long times.
+  -->
+  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
+  <!--
+   ~ Delay (in milliseconds) during which additional commit log entries
+   ~ may be written before fsync in batch mode.  This will increase
+   ~ latency slightly, but can vastly improve throughput where there are
+   ~ many writers.  Set to zero to disable (each entry will be synced
+   ~ individually).  Reasonable values range from a minimal 0.1 to 10 or
+   ~ even more if throughput matters more than latency.
+  -->
+  <!-- <CommitLogSyncBatchWindowInMS>1</CommitLogSyncBatchWindowInMS> --> 
+
+  <!--
+   ~ Time to wait before garbage-collection deletion markers.  Set this to
+   ~ a large enough value that you are confident that the deletion marker
+   ~ will be propagated to all replicas by the time this many seconds has
+   ~ elapsed, even in the face of hardware failures.  The default value is
+   ~ ten days.
+  -->
+  <GCGraceSeconds>864000</GCGraceSeconds>
+</Storage>

Propchange: cassandra/branches/cassandra-0.6/contrib/pig/storage-conf.xml
------------------------------------------------------------------------------
    svn:eol-style = native


Reply via email to