Hi Yong, Hi Mridul,
I've changed everything to chararray:
A = LOAD 'peoples.txt' USING PigStorage(';') AS (name : chararray, pets_ids
: chararray);
B = foreach A GENERATE name, STRSPLIT(pets_ids, ',') AS pets_ids_separated;
DUMP B;
DESCRIBE B;
C = FOREACH B GENERATE name, FLATTEN(TOBAG(pets_ids_separated)) AS (id);
DUMP C;
DESCRIBE C;
D = LOAD 'pets.txt' USING PigStorage(';') AS (id : chararray, type :
chararray, race: chararray);
DUMP D;
DESCRIBE D;
reqd_op = JOIN C BY id, D BY id PARALLEL 5;
DUMP reqd_op;
But I still have the error:
2011-05-10 15:59:42,472 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1107: Cannot merge join keys, incompatible types
Details at logfile: /local/tmp/test/pig_expand/pig_1305028757759.log
DUMP C is
(tom,(1234,4567,6))
(anna,(27894))
Description is C: {name: chararray,id: (null)}
How can JOIN operation can support on left hand a Tuple of chararray, and on
right hand a chararray?
BR
Vincent
On Tue, May 10, 2011 at 3:43 PM, 勇胡 <[email protected]> wrote:
> You can see the type of join keys are different. One is chararray, the
> other
> is int. You have to change them into the same type.
>
> Yong
>
> 2011/5/10 Vincent <[email protected]>
>
> > According to your advices I wrote the following:
> >
> > *A = LOAD 'peoples.txt' USING PigStorage(';') AS (name : chararray,
> > pets_ids
> > : chararray);
> >
> > B = foreach A GENERATE name, STRSPLIT(pets_ids, ',') AS
> pets_ids_separated;
> > DUMP B;
> > DESCRIBE B;
> >
> > C = FOREACH B GENERATE name, FLATTEN(TOBAG(pets_ids_separated)) AS
> > user_pet_id;
> > DUMP C;
> > DESCRIBE C;
> >
> > D = LOAD 'pets.txt' USING PigStorage(';') AS (id : int, type : chararray,
> > race: chararray);
> >
> >
> > reqd_op = JOIN C BY user_pet_id, D BY id PARALLEL 5;
> > DUMP reqd_op;*
> >
> > But I have the following error:
> > 2011-05-10 15:30:04,036 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > ERROR 1107: Cannot merge join keys, incompatible types
> > Details at logfile: /local/tmp/test/pig_expand/pig_1305026987213.log
> >
> > Any idea, what it goes wrong here?
> >
> > Best Regards
> >
> > Vincent
> >
> >
> >
> > On Tue, May 10, 2011 at 3:04 PM, Mridul Muralidharan
> > <[email protected]>wrote:
> >
> > >
> > > I am not sure I follow your query related to PARALLEL.
> > > The value for parallel is a static value.
> > >
> > > I was using $MY_PARALLEL as a placeholder to specify what sort of
> > > parallelism you need.
> > >
> > > Typically you will have a default value in the script
> > >
> > > %default MY_PARALLEL '10'
> > >
> > > And override it, when required, using command line pig -param
> > > MY_PARALLEL=50 ...
> > >
> > >
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > > On Tuesday 10 May 2011 04:26 PM, Vincent wrote:
> > >
> > >> Thanks Mridul for your quick answer!
> > >>
> > >> According to documentation PARALLEL is setting the number of reduce
> > >> tasks. So how can I make it taking an UDF instead? Is there any
> example
> > >> of such functions in SVN/pig0.8 package?
> > >>
> > >> Best Regards
> > >>
> > >> Vincent
> > >>
> > >> On Tue, May 10, 2011 at 2:02 PM, Mridul Muralidharan
> > >> <[email protected] <mailto:[email protected]>> wrote:
> > >>
> > >>
> > >> Easy option would be to write your own udf which can catch corner
> > >> cases, etc ..
> > >> But assuming your data strictly follows what you mentioned,
> > >> something like this might help (illustrative only !) :
> > >>
> > >> pets = load 'pets.txt' USING PigStorage(';') AS (pet_id:chararray,
> > >> pet_type:chararray, pet_name:chararray);
> > >>
> > >> people = load 'peoples.txt' USING PigStorage(';') AS
> > >> (user:chararray, ids:chararray);
> > >> people_t = FOREACH people GENERATE user, STRSPLIT(ids, ',');
> > >> -- STRSPLIT returns a tuple, not a bag : so convert to bag and
> > >> flatten it.
> > >> people_reqd = FOREACH people_t GENERATE user, FLATTEN(TOBAG($1)) as
> > >> (user_pet_id);
> > >>
> > >>
> > >> reqd_op = JOIN people_reqd BY user_pet_id, pets BY pet_id PARALLEL
> > >> $MY_PARALLEL;
> > >>
> > >>
> > >> reqd_op should contain what you need ...
> > >>
> > >>
> > >>
> > >> Regards,
> > >> Mridul
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Tuesday 10 May 2011 03:00 PM, Vincent wrote:
> > >>
> > >> Hello dear Pig users,
> > >>
> > >> *I am loading a file with the following format:*
> > >>
> > >> *$ cat peoples.txt
> > >> tom;1234,4567,6
> > >> anna;27894*
> > >> First field is a name, second field is a concatenation of an
> > >> unknown number
> > >> of pets ids.
> > >>
> > >> *I would like to JOIN this file with another one:*
> > >>
> > >> *$ cat pets.txt
> > >> 1234;dog;cocker
> > >> 4567;mouse;usa
> > >> 6;cat;persian
> > >> 27894;cat;manx
> > >> *Fields are pet's id, pet's type, pet's race.
> > >> *
> > >> to get the following result:*
> > >>
> > >> *1234;dog;cocker;tom
> > >> 4567;mouse;usa;tom
> > >> 6;cat;persian;tom
> > >> 27894;cat;manx;anna*
> > >>
> > >> *Problem is that I don't know how to convert a tuple of fields
> > >> to lines,
> > >> i.e. to put the file peoples.txt into the following
> intermediate
> > >> format:*
> > >> *tom,1234
> > >> tom,4567
> > >> tom,6
> > >> anna,27894*
> > >>
> > >> Thanks in advance for your help!
> > >>
> > >>
> > >> Vincent Hervieux
> > >>
> > >>
> > >>
> > >>
> > >
> >
>