Re: Pig not reading all cassandra data

2011-02-17 Thread Matt Kennedy
I have a resolution for how I'm dealing with this problem for my particular situation and I'd like to throw it out there to see if you think it should be integrated into the core Cassandra code. Just to repeat, the immediate workaround for this is to set -Dpig.splitCombination=false when you

Re: Pig not reading all cassandra data

2011-02-17 Thread Jonathan Ellis
Thanks a lot for the help on this! From what I can tell that looks like a good solution. Created https://issues.apache.org/jira/browse/CASSANDRA-2184 to make that change. On Thu, Feb 17, 2011 at 11:52 AM, Matt Kennedy stinkym...@gmail.com wrote: I have a resolution for how I'm dealing with

Re: Pig not reading all cassandra data

2011-02-11 Thread Matt Kennedy
Sorry it has taken me a while to get back to this. I'm still trying to get to the bottom of this to find where the disconnect is between the column family input format code and the Pig optimizer. I suspected that the problem was line 365 of:

Re: Pig not reading all cassandra data

2011-02-04 Thread Matt Kennedy
Found the culprit. There is a new feature in Pig 0.8 that will try to reduce the number of splits used to speed up the whole job. Since the ColumnFamilyInputFormat lists the input size as zero, this feature eliminates all of the splits except for one. The workaround is to disable this

Re: Pig not reading all cassandra data

2011-02-04 Thread Jonathan Ellis
On Fri, Feb 4, 2011 at 9:47 PM, Matt Kennedy stinkym...@gmail.com wrote: Found the culprit.  There is a new feature in Pig 0.8 that will try to reduce the number of splits used to speed up the whole job.  Since the ColumnFamilyInputFormat lists the input size as zero, this feature eliminates

Re: Pig not reading all cassandra data

2011-02-02 Thread Matthew E. Kennedy
I noticed in the jobtracker log that when the pig job kicks off, I get the following info message: 2011-02-02 09:13:07,269 INFO org.apache.hadoop.mapred.JobInProgress: Input size for job job_201101241634_0193 = 0. Number of splits = 1 So I looked at the job.split file that is created for the

Pig not reading all cassandra data

2011-02-01 Thread Matthew E. Kennedy
I'm running Cassandra 0.7 and I'm trying to get Pig integration to work correctly. I'm using Pig 0.8 running against Hadoop 20.2, I've also tried this running against CDH2. I can log into the grunt shell, and execute scripts, but when they run, they don't read all of the data from Cassandra.