Re: pig_cassandra problem - "Incompatible field schema" error

Pete Warden Mon, 17 Oct 2011 03:26:40 -0700

JIRA filed, with a messy patch too:
https://issues.apache.org/jira/browse/CASSANDRA-3371


cheers,
           Pete

On Mon, Oct 17, 2011 at 2:27 AM, Pete Warden <p...@jetpac.com> wrote:

> I've dug deeper into this, since this got my script running but still left
> me at sea when dealing with the actual data. It's looking like there may be
> a mismatch between the schema that's being reported by
> CassandraStorage.java, and the data that's actually returned. Here's an
> example:
>
> rows = LOAD 'cassandra://Frap/PhotoVotes' USING CassandraStorage();
> DESCRIBE rows;
> rows: {key: chararray,columns: {(name: chararray,value:
> bytearray,photo_owner: chararray,value_photo_owner: bytearray,pid:
> chararray,value_pid: bytearray,matched_string:
> chararray,value_matched_string: bytearray,src_big: chararray,value_src_big:
> bytearray,time: chararray,value_time: bytearray,vote_type:
> chararray,value_vote_type: bytearray,voter: chararray,value_voter:
> bytearray)}}
> DUMP rows;
> (691831038_1317937188.48955,{(photo_owner,1596090180),(pid,6855155124568798560),(matched_string,),(src_big,),(time,Thu
> Oct 06 14:39:48 -0700 2011),(vote_type,album_dislike),(voter,691831038)})
>
> getSchema() is reporting the columns as an inner bag of tuples, each of
> which contains 16 values. In fact, getNext() seems to return an inner bag
> containing 7 tuples, each of which contains two values.
>
> I'll file a JIRA and do my best to create a patch to do the right thing,
> but I wanted to sanity check what I'm seeing here since I'm a Pig newbie. Am
> I missing something? It appears that things got out of sync with this
> change:
>
> http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.8/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?r1=1177083&r2=1177082&pathrev=1177083
>
> While I'm in there, is there a reason for using an inner bag to hold the
> columns? Do we ever have more than one set of the same columns for a given
> key in Cassandra? I'm thinking of tweaking things to look like this for my
> example, since it would make my processing code easier:
> rows: {cassandra_key: chararray, photo_owner: chararray, pid: chararray,
> matched_string: chararray, src_big: chararray, time: chararray,vote_type:
> chararray,voter: chararray}
> The main downside I can see is the possible clash between the cassandra_key
> value and a column with the same name.
>
> cheers,
>            Pete
>
> On Tue, Oct 11, 2011 at 11:59 PM, Pete Warden <p...@jetpac.com> wrote:
>
>> For posterity, I ended up hacking around this by renaming the repeated
>> 'value' alias in CassandraStorage and rebuilding it. Here's the patch:
>>
>> ---
>> src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java.original 
>> 2011-10-11
>> 23:42:19.000000000 -0700
>> +++ src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java 2011-10-11
>> 23:44:26.000000000 -0700
>> @@ -357,7 +357,7 @@
>>              validator = validators.get(cdef.getName());
>>              if (validator == null)
>>                  validator = marshallers.get(1);
>> -            valSchema.setName("value");
>> +            valSchema.setName("value_"+new String(cdef.getName()));
>>              valSchema.setType(getPigType(validator));
>>              tupleFields.add(valSchema);
>>          }
>>
>> I'm not suggesting this is a correct fix, but it does allow me to move
>> forward. Another suggestion was to try Pig 0.8.1 instead, but I ran into
>> https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AWhatshallIdoifIsaw%22FailedtocreateDataStorage%22%3F
>>
>> On Tue, Oct 11, 2011 at 10:34 PM, Pete Warden <p...@jetpac.com> wrote:
>>
>>> Thanks for all your help Brandon and Jeremy, that got me to the point
>>> where I could load data.
>>>
>>> I'm now hitting a new issue that seems like it could possibly be related.
>>> When I try to access the data like this:
>>>
>>> grunt> rows = LOAD 'cassandra://Frap/FriendsAlreadyRanked' USING
>>> CassandraStorage();
>>> grunt> parts = FOREACH rows GENERATE key,
>>> FromCassandraBag('time_last_ranked', columns);
>>>
>>> I see the following error:
>>>
>>> 2011-10-11 22:23:43,877 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>> ERROR 1108:
>>> <line 4, column 71> Duplicate schema alias: value in "columns"
>>>
>>> At first I thought it might be related to the Pygmalion helper functions,
>>> so I tried to strip it back to basics using this second line instead:
>>>
>>> parts = FOREACH rows GENERATE key,$1;
>>>
>>> and I still get an identical error.
>>>
>>> Any further thoughts on how I can dig into this?
>>>
>>> Thanks again,
>>>                     Pete
>>>
>>> On Tue, Oct 11, 2011 at 3:37 PM, Brandon Williams <dri...@gmail.com>wrote:
>>>
>>>> On Tue, Oct 11, 2011 at 4:24 PM, Pete Warden <p...@petewarden.com>
>>>> wrote:
>>>> > I'm trying to run the most basic example for pig_cassandra, counting
>>>> the
>>>> > number of rows in a column family, and I'm hitting the following
>>>> error:
>>>> > 2011-10-11 14:13:32,321 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>> -
>>>> > ERROR 1031: Incompatable field schema: left is
>>>> > "columns:bag{:tuple(name:bytearray,value:bytearray)}", right is
>>>> >
>>>> "columns:bag{:tuple(name:chararray,value:bytearray,time_last_ranked:chararray,value:bytearray)}"
>>>>
>>>> After https://issues.apache.org/jira/browse/CASSANDRA-2777 you need to
>>>> remove the 'AS' and everything after it; your schema definition
>>>> conflicts with what was inferred.
>>>>
>>>> -Brandon
>>>>
>>>
>>>
>>
>

Re: pig_cassandra problem - "Incompatible field schema" error

Reply via email to