According to your advices I wrote the following:

*A = LOAD 'peoples.txt' USING PigStorage(';') AS (name : chararray, pets_ids
: chararray);

B = foreach A GENERATE name, STRSPLIT(pets_ids, ',') AS pets_ids_separated;
DUMP B;
DESCRIBE B;

C = FOREACH B GENERATE name, FLATTEN(TOBAG(pets_ids_separated)) AS
user_pet_id;
DUMP C;
DESCRIBE C;

D = LOAD 'pets.txt' USING PigStorage(';') AS (id : int, type : chararray,
race: chararray);


reqd_op = JOIN C BY user_pet_id, D BY id PARALLEL 5;
DUMP reqd_op;*

But I have the following error:
2011-05-10 15:30:04,036 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1107: Cannot merge join keys, incompatible types
Details at logfile: /local/tmp/test/pig_expand/pig_1305026987213.log

Any idea, what it goes wrong here?

Best Regards

Vincent



On Tue, May 10, 2011 at 3:04 PM, Mridul Muralidharan
<[email protected]>wrote:

>
> I am not sure I follow your query related to PARALLEL.
> The value for parallel is a static value.
>
> I was using $MY_PARALLEL as a placeholder to specify what sort of
> parallelism you need.
>
> Typically you will have a default value in the script
>
> %default MY_PARALLEL '10'
>
> And override it, when required, using command line pig -param
> MY_PARALLEL=50 ...
>
>
>
> Regards,
> Mridul
>
>
> On Tuesday 10 May 2011 04:26 PM, Vincent wrote:
>
>> Thanks Mridul for your quick answer!
>>
>> According to documentation PARALLEL is setting the number of reduce
>> tasks. So how can I make it taking an UDF instead? Is there any example
>> of such functions in SVN/pig0.8 package?
>>
>> Best Regards
>>
>> Vincent
>>
>> On Tue, May 10, 2011 at 2:02 PM, Mridul Muralidharan
>> <[email protected] <mailto:[email protected]>> wrote:
>>
>>
>>    Easy option would be to write your own udf which can catch corner
>>    cases, etc  ..
>>    But assuming your data strictly follows what you mentioned,
>>    something like this might help (illustrative only !) :
>>
>>    pets = load 'pets.txt'  USING PigStorage(';') AS (pet_id:chararray,
>>    pet_type:chararray, pet_name:chararray);
>>
>>    people = load 'peoples.txt'  USING PigStorage(';') AS
>>    (user:chararray, ids:chararray);
>>    people_t = FOREACH people GENERATE user, STRSPLIT(ids, ',');
>>    -- STRSPLIT returns a tuple, not a bag : so convert to bag and
>>    flatten it.
>>    people_reqd = FOREACH people_t GENERATE user, FLATTEN(TOBAG($1)) as
>>    (user_pet_id);
>>
>>
>>    reqd_op = JOIN people_reqd BY user_pet_id, pets BY pet_id PARALLEL
>>    $MY_PARALLEL;
>>
>>
>>    reqd_op should contain what you need ...
>>
>>
>>
>>    Regards,
>>    Mridul
>>
>>
>>
>>
>>
>>    On Tuesday 10 May 2011 03:00 PM, Vincent wrote:
>>
>>        Hello dear Pig users,
>>
>>        *I am loading a file with the following format:*
>>
>>        *$ cat peoples.txt
>>        tom;1234,4567,6
>>        anna;27894*
>>        First field is a name, second field is a concatenation of an
>>        unknown number
>>        of pets ids.
>>
>>        *I would like to JOIN this file with another one:*
>>
>>        *$ cat pets.txt
>>        1234;dog;cocker
>>        4567;mouse;usa
>>        6;cat;persian
>>        27894;cat;manx
>>        *Fields are pet's id, pet's type, pet's race.
>>        *
>>        to get the following result:*
>>
>>        *1234;dog;cocker;tom
>>        4567;mouse;usa;tom
>>        6;cat;persian;tom
>>        27894;cat;manx;anna*
>>
>>        *Problem is that I don't know how to convert a tuple of fields
>>        to lines,
>>        i.e. to put the file peoples.txt into the following intermediate
>>        format:*
>>        *tom,1234
>>        tom,4567
>>        tom,6
>>        anna,27894*
>>
>>        Thanks in advance for your help!
>>
>>
>>             Vincent Hervieux
>>
>>
>>
>>
>

Reply via email to