I need to update the wiki with better pig info. I did put some information in the getting started docs of pygmalion, but it would be good to transfer that to cassandra's wiki and add to it. fwiw - https://github.com/jeromatron/pygmalion/wiki/Getting-Started
Thanks for the rundown William! On Jun 8, 2011, at 4:11 PM, William Oberman wrote: > I decided to try out hadoop/pig + cassandra. I had my ups and downs to get > the script I wanted to run to work. I'm sure everyone who tries will have > their own experiences/problems, but mine were: > > -Everything I need to know was in > http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html and > http://wiki.apache.org/cassandra/HadoopSupport > > -Java is really picky about hostnames. I'm in EC2, and rather than rely on > DNS, I basically have all of my machines share an /etc/hosts file. But, the > command line "hostname" wasn't returning the same thing as in /etc/hosts, > which caused all kinds of weird hadoop issues at first. (I had hostname as > "foo" and /etc/hosts had "foo.prod"). > > -I forgot I had iptables on. It's always easier to not have firewalls to > start (this is true when configuring anything of course) > > -Use the same version of everything everywhere. And for hadoop/pig, I was > having issues until I used the combination of hadoop-0.20.2 + pig-0.8.1. > > -For hadoop's mapred-site.xml you HAVE to supply a port (hostname:port), and > there isn't a standard, and it seems arbitrary. I used 8021, based on notes > in a case somewhere from hadoop (I think trying to standardize). > > It took me awhile to figure the syntax of Pig Latin out, but I finally > managed to get a script that does a count of all columns in a column family: > rows = LOAD 'cassandra://keyspace/columnfamily' USING CassandraStorage(); > filter_rows = FILTER rows BY $1 is not null; > counts = FOREACH filter_rows GENERATE COUNT($1); > counts_in_bag = GROUP counts ALL; > sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); > dump sum_of_bag; > > I'm trying to see the impact of running hadoop on the same servers as > cassandra now. And yes, I've seen the note in the wiki about the clever > partitioning of cassandra nodes to allow for "web latency" nodes + "hadoop > processing" nodes :-) >