Re: hadoop/pig notes

Jeremy Hanna Wed, 08 Jun 2011 15:40:03 -0700

I need to update the wiki with better pig info.  I did put some information in 
the getting started docs of pygmalion, but it would be good to transfer that to 
cassandra's wiki and add to it.
fwiw - https://github.com/jeromatron/pygmalion/wiki/Getting-Started


Thanks for the rundown William!


On Jun 8, 2011, at 4:11 PM, William Oberman wrote:

> I decided to try out hadoop/pig + cassandra.  I had my ups and downs to get 
> the script I wanted to run to work.  I'm sure everyone who tries will have 
> their own experiences/problems, but mine were:
> 
> -Everything I need to know was in 
> http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html and 
> http://wiki.apache.org/cassandra/HadoopSupport
> 
> -Java is really picky about hostnames.  I'm in EC2, and rather than rely on 
> DNS, I basically have all of my machines share an /etc/hosts file.  But, the 
> command line "hostname" wasn't returning the same thing as in /etc/hosts, 
> which caused all kinds of weird hadoop issues at first.  (I had hostname as 
> "foo" and /etc/hosts had "foo.prod").
> 
> -I forgot I had iptables on.  It's always easier to not have firewalls to 
> start (this is true when configuring anything of course)
> 
> -Use the same version of everything everywhere.  And for hadoop/pig, I was 
> having issues until I used the combination of hadoop-0.20.2 + pig-0.8.1.
> 
> -For hadoop's mapred-site.xml you HAVE to supply a port (hostname:port), and 
> there isn't a standard, and it seems arbitrary.  I used 8021, based on notes 
> in a case somewhere from hadoop (I think trying to standardize).
> 
> It took me awhile to figure the syntax of Pig Latin out, but I finally 
> managed to get a script that does a count of all columns in a column family:
> rows = LOAD 'cassandra://keyspace/columnfamily' USING CassandraStorage();
> filter_rows = FILTER rows BY $1 is not null;
> counts = FOREACH filter_rows GENERATE COUNT($1);
> counts_in_bag = GROUP counts ALL; 
> sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1); 
> dump sum_of_bag;
> 
> I'm trying to see the impact of running hadoop on the same servers as 
> cassandra now.  And yes, I've seen the note in the wiki about the clever 
> partitioning of cassandra nodes to allow for "web latency" nodes + "hadoop 
> processing" nodes :-)
>

Re: hadoop/pig notes

Reply via email to