Re: Help with Pig Script

2011-11-17 Thread Jeremy Hanna
If you are only interested in loading one row, why do you need to use Pig?  Is 
it an extremely wide row?

Unless you are using an ordered partitioner, you can't limit the rows you 
mapreduce over currently - you have to mapreduce over the whole column family.  
That will change probably in 1.1.  However, again, if you're only after 1 row, 
why don't you just use a regular cassandra client and get that row and operate 
on it that way?

I suppose you *could* use pig and filter by the ID or something.  If you *do* 
have an ordered partitioner in your cluster, it's just a matter of specifying 
the key range.

On Nov 17, 2011, at 11:16 AM, Aaron Griffith wrote:

 I am trying to do the following with a PIG script and am having trouble 
 finding 
 the correct syntax.
 
 - I want to use the LOAD function to load a single key/value row into a pig 
 object.
 - The contents of that row is then flattened into a list of keys.
 - I then want to use that list of keys for another load function to select 
 the 
 key/value pairs from another column family.
 
 The only way I can get this to work is by using a generic load function then 
 applying filters to get at the data I want. Then joining the two pig objects 
 together to filter the second column family.
 
 I want to avoid having to pull the entire column familys into pig, it is way 
 too 
 much data.
 
 Any suggestions?
 
 Thanks!
 



Re: Help with Pig Script

2011-11-17 Thread Aaron Griffith
Jeremy Hanna jeremy.hanna1234 at gmail.com writes:

 
 If you are only interested in loading one row, why do you need to use Pig?  
 Is 
it an extremely wide row?
 
 Unless you are using an ordered partitioner, you can't limit the rows you 
mapreduce over currently - you
 have to mapreduce over the whole column family.  That will change probably in 
1.1.  However, again, if
 you're only after 1 row, why don't you just use a regular cassandra client 
 and 
get that row and operate on it
 that way?
 
 I suppose you *could* use pig and filter by the ID or something.  If you *do* 
have an ordered partitioner in
 your cluster, it's just a matter of specifying the key range.
 
 On Nov 17, 2011, at 11:16 AM, Aaron Griffith wrote:
 
  I am trying to do the following with a PIG script and am having trouble 
finding 
  the correct syntax.
  
  - I want to use the LOAD function to load a single key/value row into a 
pig 
  object.
  - The contents of that row is then flattened into a list of keys.
  - I then want to use that list of keys for another load function to select 
the 
  key/value pairs from another column family.
  
  The only way I can get this to work is by using a generic load function 
  then 
  applying filters to get at the data I want. Then joining the two pig 
  objects 
  together to filter the second column family.
  
  I want to avoid having to pull the entire column familys into pig, it is 
  way 
too 
  much data.
  
  Any suggestions?
  
  Thanks!
  
 
 


It is a very wide row, with nested keys to another column family.  Pig makes it 
easy convert it into a list of keys.

It also makes it easy to write out the results into Hadoop.

I then want to take that list of keys to go get rows from whatever column 
family 
they are for.

Thanks for you response.




Re: Help with Pig Script

2011-11-17 Thread Jeremy Hanna

On Nov 17, 2011, at 1:44 PM, Aaron Griffith wrote:

 Jeremy Hanna jeremy.hanna1234 at gmail.com writes:
 
 
 If you are only interested in loading one row, why do you need to use Pig?  
 Is 
 it an extremely wide row?
 
 Unless you are using an ordered partitioner, you can't limit the rows you 
 mapreduce over currently - you
 have to mapreduce over the whole column family.  That will change probably 
 in 
 1.1.  However, again, if
 you're only after 1 row, why don't you just use a regular cassandra client 
 and 
 get that row and operate on it
 that way?
 
 I suppose you *could* use pig and filter by the ID or something.  If you 
 *do* 
 have an ordered partitioner in
 your cluster, it's just a matter of specifying the key range.
 
 On Nov 17, 2011, at 11:16 AM, Aaron Griffith wrote:
 
 I am trying to do the following with a PIG script and am having trouble 
 finding 
 the correct syntax.
 
 - I want to use the LOAD function to load a single key/value row into a 
 pig 
 object.
 - The contents of that row is then flattened into a list of keys.
 - I then want to use that list of keys for another load function to select 
 the 
 key/value pairs from another column family.
 
 The only way I can get this to work is by using a generic load function 
 then 
 applying filters to get at the data I want. Then joining the two pig 
 objects 
 together to filter the second column family.
 
 I want to avoid having to pull the entire column familys into pig, it is 
 way 
 too 
 much data.
 
 Any suggestions?
 
 Thanks!
 
 
 
 
 
 It is a very wide row, with nested keys to another column family.  Pig makes 
 it 
 easy convert it into a list of keys.
 
 It also makes it easy to write out the results into Hadoop.
 
 I then want to take that list of keys to go get rows from whatever column 
 family 
 they are for.
 
 Thanks for you response.
 
 

Okay.  Makes sense.  There is work being done to support wide rows with 
mapreduce - https://issues.apache.org/jira/browse/CASSANDRA-3264 which is now 
being worked on as part of transposition - 
https://issues.apache.org/jira/browse/CASSANDRA-2474.  Transposition would make 
it so each wide row would turn into several transposed rows - (key, column, 
value) combinations.

I think the easiest way to do what you're trying to do is to use a client to 
page through the row and get the whole thing, then you can copy that up to hdfs 
or whatever else you want to do with it.