Thanks a lot for the help on this! >From what I can tell that looks like a good solution. Created https://issues.apache.org/jira/browse/CASSANDRA-2184 to make that change.
On Thu, Feb 17, 2011 at 11:52 AM, Matt Kennedy <stinkym...@gmail.com> wrote: > I have a resolution for how I'm dealing with this problem for my particular > situation and I'd like to throw it out there to see if you think it should > be integrated into the core Cassandra code. > > Just to repeat, the immediate workaround for this is to set > -Dpig.splitCombination=false when you launch pig. > > However, we wanted to keep splitCombination on because it is a useful > optimization for a lot of our use cases, so I went digging for the least > intrusive way to keep the split combiner on, but also prevent it from > combining splits that read from Cassandra. My solution, which you are > welcome to critique, is to change line 65 of > http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.java > such that it returns Long.MAX_VALUE instead of zero. > > That effectively turns off split combination in Pig 0.8 when reading from > Cassandra, but leaves it on for everything else. So far, I can't see any > negative side effects from it. > > Thoughts? > > > On Fri, Feb 11, 2011 at 3:37 PM, Matt Kennedy <stinkym...@gmail.com> wrote: >> >> Sorry it has taken me a while to get back to this. I'm still trying to >> get to the bottom of this to find where the disconnect is between the column >> family input format code and the Pig optimizer. >> >> I suspected that the problem was line 365 of: >> >> http://svn.apache.org/viewvc/pig/tags/release-0.8.0/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java?view=markup >> >> ...but I changed the ColumnFamilySplit.java file so that it returns -1 >> instead of 0, the result of which is that the Pig job will iterate over the >> entirety of the cassandra data that it is supposed to, but it does so with >> only one mapper. It looks like the Pig map combiner isn't using the >> split.getLength call to determine how the maps get combined as I originally >> suspected. I'll update when I figure more out. >> >> -Matt >> >> On Sat, Feb 5, 2011 at 1:01 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >>> >>> On Fri, Feb 4, 2011 at 9:47 PM, Matt Kennedy <stinkym...@gmail.com> >>> wrote: >>> > Found the culprit. There is a new feature in Pig 0.8 that will try to >>> > reduce the number of splits used to speed up the whole job. Since the >>> > ColumnFamilyInputFormat lists the input size as zero, this feature >>> > eliminates all of the splits except for one. >>> > >>> > The workaround is to disable this feature for jobs that use >>> > CassandraStorage >>> > by setting -Dpig.splitCombination=false in the pig_cassandra script. >>> > >>> > Hope somebody finds this useful, you wouldn't believe how many >>> > dead-ends I >>> > ran down trying to figure this out. >>> >>> Ouch, thanks for tracking that down. >>> >>> What should CFIF be returning differently? Do you mean the >>> InputSplit.getLength? >>> >>> -- >>> Jonathan Ellis >>> Project Chair, Apache Cassandra >>> co-founder of DataStax, the source for professional Cassandra support >>> http://www.datastax.com >> > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com