Re: CqlStorage creates wrong schema for Pig

Miguel Angel Martin junquera Wed, 04 Sep 2013 03:08:10 -0700

Oppps,   sorry by  my oversight

I was checking the code and  I was surprised it did not work with that pig
script ...


now , It works fine ..

Many thanks,Chad

Have a nice day


Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com



2013/9/3 Chad Johnston <cjohns...@megatome.com>

> You're trying to use FromCqlColumn on a tuple that has been flattened. The
> schema still thinks it's {title: chararray}, but the flattened tuple is now
> two values. I don't know how to retrieve the data values in this case.
>
> Your code will work correctly if you do this:
> *values3 = FOREACH rows GENERATE FromCqlColumn(title) AS title;*
> *dump values3;*
> *describe values3;*
>
> (Use FromCqlColumn on the original data, not the flattened data.)
>
> Chad
>
>
> On Mon, Sep 2, 2013 at 8:45 AM, Miguel Angel Martin junquera <
> mianmarjun.mailingl...@gmail.com> wrote:
>
>> Hi
>>
>>
>> 1.-
>>
>> May be?
>>
>> -- Register the UDF
>> REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT
>>
>> -- FromCqlColumn will convert chararray, int, long, float, double
>> DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();
>>
>> -- Load data as normal
>> data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage();
>>
>> -- Use the UDF
>> data = FOREACH data_raw GENERATE
>>     *FromCqlColumn*(isbn) AS ISBN,
>>     *FromCqlColumn*(bookauthor) AS BookAuthor,
>>
>>
>>     *FromCqlColumn*(booktitle) AS BookTitle,
>>     *FromCqlColumn*(publisher) AS Publisher,
>>
>>
>>     *FromCqlColumn*(yearofpublication) AS YearOfPublication;
>>
>>
>>
>>
>>
>> and  2.:
>>
>> with  the data in cql cassandra 1.2.8, pig 0.11.11 and cql3:
>>
>> *CREATE KEYSPACE keyspace1*
>>
>> *  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor'
>> : 1 }*
>>
>> *  AND durable_writes = true;*
>>
>> *
>> *
>>
>> *use keyspace2;*
>>
>>  *
>> *
>>
>> *  CREATE TABLE test (*
>>
>> *    id text PRIMARY KEY,*
>>
>> *    title text,*
>>
>> *    age int*
>>
>> *  )  WITH COMPACT STORAGE;*
>>
>> *
>> *
>>
>> *
>> *
>>
>> *  insert into test (id, title, age) values('1', 'child', 21);*
>>
>> *  insert into test (id, title, age) values('2', 'support', 21);*
>>
>> *  insert into test (id, title, age) values('3', 'manager', 31);*
>>
>> *  insert into test (id, title, age) values('4', 'QA', 41);*
>>
>> *  insert into test (id, title, age) values('5', 'QA', 30);*
>>
>> *  insert into test (id, title, age) values('6', 'QA', 30);*
>>
>>
>>
>>
>>
>> and script:
>>
>> *
>> *
>> *register './libs/cqlstorageudf-1.0-SNAPSHOT.jar';*
>> *DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();*
>> *rows = LOAD
>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>> CqlStorage();*
>> *dump rows;*
>> *ILLUSTRATE rows;*
>> *describe rows;*
>> *A = FOREACH rows GENERATE FLATTEN(title);*
>> *dump A;*
>> *values3 = FOREACH A GENERATE FromCqlColumn(title) AS title;*
>> *dump values3;*
>> *describe values3;*
>>
>>
>> --
>>
>>
>>
>> I have this error:
>>
>>
>>
>>
>> ....
>>
>> -------------------------------------------------------------
>> | rows     | id:chararray   | age:int   | title:chararray   |
>> -------------------------------------------------------------
>> |          | (id, 5)        | (age, 30) | (title, QA)       |
>> -------------------------------------------------------------
>>
>> rows: {id: chararray,age: int,title: chararray}
>>
>>
>> ...
>>
>> (title,QA)
>> (title,QA)
>> ..
>> 2013-09-02 16:40:52,454 [Thread-11] WARN
>>  org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
>> *java.lang.ClassCastException: java.lang.String cannot be cast to
>> org.apache.pig.data.Tuple*
>> at com.megatome.pig.piggybank.tuple.ColumnBase.exec(ColumnBase.java:32)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
>> at
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>>  at
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>  at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> 2013-09-02 16:40:52,832 [main] INFO
>>  
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - HadoopJobId: job_local_0003
>>
>>
>>
>> 8-|
>>
>> Regards
>>
>> ...
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.mar...@brainsins.com
>>
>>
>>
>> 2013/9/2 Miguel Angel Martin junquera <mianmarjun.mailingl...@gmail.com>
>>
>>> hi all:
>>>
>>> More info :
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-5941
>>>
>>>
>>>
>>> I tried this (and gen. cassandra 1.2.9)  but do not work for me,
>>>
>>> git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
>>> cd cassandra
>>> git checkout cassandra-1.2
>>> patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
>>> ant
>>>
>>>
>>>
>>> Miguel Angel Martín Junquera
>>> Analyst Engineer.
>>> miguelangel.mar...@brainsins.com
>>>
>>>
>>>
>>> 2013/9/2 Miguel Angel Martin junquera <mianmarjun.mailingl...@gmail.com>
>>>
>>>> *good/nice job !!!*
>>>> *
>>>> *
>>>> *
>>>> *
>>>> *I'd testing with an udf only with  string schema type  this is better
>>>> and elaborate work..*
>>>> *
>>>> *
>>>> *Regads*
>>>>
>>>>
>>>> Miguel Angel Martín Junquera
>>>> Analyst Engineer.
>>>> miguelangel.mar...@brainsins.com
>>>>
>>>>
>>>>
>>>> 2013/8/31 Chad Johnston <cjohns...@megatome.com>
>>>>
>>>>> I threw together a quick UDF to work around this issue. It just
>>>>> extracts the value portion of the tuple while taking advantage of the
>>>>> CqlStorage generated schema to keep the type correct.
>>>>>
>>>>> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>>>>>
>>>>> I'll see if I can find more useful information and open a defect,
>>>>> since that's what this seems to be.
>>>>>
>>>>> Chad
>>>>>
>>>>>
>>>>> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
>>>>> mianmarjun.mailingl...@gmail.com> wrote:
>>>>>
>>>>>> I try this:
>>>>>>
>>>>>>  *rows = LOAD
>>>>>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' 
>>>>>> USING
>>>>>> CqlStorage();*
>>>>>>
>>>>>> *dump rows;*
>>>>>>
>>>>>> *ILLUSTRATE rows;*
>>>>>>
>>>>>> *describe rows;*
>>>>>>
>>>>>> *
>>>>>> *
>>>>>>
>>>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>>>>>> (mycolumn:tuple(name,value));*
>>>>>>
>>>>>> *dump values2;*
>>>>>>
>>>>>> *describe values2;*
>>>>>> *
>>>>>> *
>>>>>>
>>>>>> But I get this results:
>>>>>>
>>>>>>
>>>>>>
>>>>>> -------------------------------------------------------------
>>>>>> | rows     | id:chararray   | age:int   | title:chararray   |
>>>>>> -------------------------------------------------------------
>>>>>> |          | (id, 6)        | (age, 30) | (title, QA)       |
>>>>>> -------------------------------------------------------------
>>>>>>
>>>>>> rows: {id: chararray,age: int,title: chararray}
>>>>>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>>>> - ERROR 1031: Incompatable field schema: left is
>>>>>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>>>>>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> or
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....
>>>>>>
>>>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>>>>>> *dump values2;*
>>>>>> *describe values2;*
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> and  the results are:
>>>>>>
>>>>>>
>>>>>> ...
>>>>>> (((id,6)))
>>>>>> (((id,5)))
>>>>>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>>>>>
>>>>>>
>>>>>>
>>>>>> Aggg!!!!!
>>>>>>
>>>>>>
>>>>>> *
>>>>>> *
>>>>>>
>>>>>>
>>>>>>
>>>>>> Miguel Angel Martín Junquera
>>>>>> Analyst Engineer.
>>>>>> miguelangel.mar...@brainsins.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/8/26 Miguel Angel Martin junquera <
>>>>>> mianmarjun.mailingl...@gmail.com>
>>>>>>
>>>>>>> hi Chad .
>>>>>>>
>>>>>>> I have this issue
>>>>>>>
>>>>>>> I send a mail to user-pig-list and  I still i can resolve this, and
>>>>>>> I can not  access to column values.
>>>>>>> In this mail  I write some things that I try without results... and
>>>>>>> information about this issue.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3ccajeg_hq9s2po3_xytzx5xki4j1mao8q26jydg2wndy_kyiv...@mail.gmail.com%3E
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I hope  someOne reply  one comment, idea or  solution about  this
>>>>>>> issue or bug.
>>>>>>>
>>>>>>>
>>>>>>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i
>>>>>>> do not have configure the environmetn to debug  and trace this issue.
>>>>>>>
>>>>>>> Only  I find some comments like, but I do not understand at all.
>>>>>>>
>>>>>>>
>>>>>>> /**
>>>>>>>
>>>>>>>  * A LoadStoreFunc for retrieving data from and storing data to
>>>>>>> Cassandra
>>>>>>>
>>>>>>>  *
>>>>>>>
>>>>>>>  * A row from a standard CF will be returned as nested tuples:
>>>>>>>
>>>>>>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2,
>>>>>>> val2))).
>>>>>>>  */
>>>>>>>
>>>>>>>
>>>>>>> I you found some idea or solution, please post it
>>>>>>>
>>>>>>> thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2013/8/23 Chad Johnston <cjohns...@megatome.com>
>>>>>>>
>>>>>>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>>>>>>
>>>>>>>> I'm loading some simple data from Cassandra into Pig using
>>>>>>>> CqlStorage. The CqlStorage loader defines a Pig schema based on the
>>>>>>>> Cassandra schema, but it seems to be wrong.
>>>>>>>>
>>>>>>>> If I do:
>>>>>>>>
>>>>>>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>>>>>>> DESCRIBE data;
>>>>>>>>
>>>>>>>> I get this:
>>>>>>>>
>>>>>>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>>>>>>> chararray,publisher: chararray,yearofpublication: int}
>>>>>>>>
>>>>>>>> However, if I DUMP data, I get results like these:
>>>>>>>>
>>>>>>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in
>>>>>>>> the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>>>>>>
>>>>>>>> Clearly the results from Cassandra are key/value pairs, as would be
>>>>>>>> expected. I don't know why the schema generated by CqlStorage() would 
>>>>>>>> be so
>>>>>>>> different.
>>>>>>>>
>>>>>>>> This is really causing me problems trying to access the column
>>>>>>>> values. I tried a naive approach of FLATTENing each tuple, then trying 
>>>>>>>> to
>>>>>>>> access the values that way:
>>>>>>>>
>>>>>>>> flattened = FOREACH data GENERATE
>>>>>>>>   FLATTEN(isbn),
>>>>>>>>   FLATTEN(booktitle),
>>>>>>>>   ...
>>>>>>>> values = FOREACH flattened GENERATE
>>>>>>>>   $1 AS ISBN,
>>>>>>>>   $3 AS BookTitle,
>>>>>>>>   ...
>>>>>>>>
>>>>>>>> As soon as I try to access field $5, Pig complains about the index
>>>>>>>> being out of bounds.
>>>>>>>>
>>>>>>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>>>>>>> something wrong, or have I stumbled across a defect?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Chad
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: CqlStorage creates wrong schema for Pig

Reply via email to