Re: About Multiple Join in Pig

mingda li Wed, 02 Nov 2016 11:46:27 -0700

Dear all,

Hi, now I wants to import a UDF function to pig command. Has anyone ever
done so? I want to import google's guava/murmur3_32 to pig. Could anyone
give some useful materials or suggestion？


Bests,
Mingda

On Wed, Nov 2, 2016 at 2:11 AM, mingda li <[email protected]> wrote:

> Yeah, I see. Thanks for your reply.
>
> Bests,
> Mingda
>
> On Tue, Nov 1, 2016 at 9:20 PM, Daniel Dai <[email protected]> wrote:
>
>> Yes, you need to dump/store xxx_OrderRes to kick off the job. You will
>> see two MapReduce jobs corresponding to the first and second join.
>>
>> Thanks,
>> Daniel
>>
>>
>>
>> On 11/1/16, 10:52 AM, "mingda li" <[email protected]> wrote:
>>
>> >Dear Dai,
>> >
>> >Thanks for your reply.
>> >What I want to do is to compare the two different order of join. The
>> query
>> >is as following:
>> >
>> >*Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
>> cs_item_sk;*
>> >*Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
>> >catalog_returns BY (cr_item_sk, cr_order_number);*
>> >*Dump or Store Bad_OrderRes;*
>> >
>> >*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk, cr_order_number),
>> >catalog_sales BY (cs_item_sk, cs_order_number);*
>> >*Good_OrderRes = JOIN Good_OrderIn  BY  cs_item_sk, inventory BY
>> > inv_item_sk;*
>> >*Dump or Store Good_OrderRes;*
>> >
>> >Since Pig execute the query lazily, I think only by Dump or Store the
>> >result, I can know the time of MapReduce Job, is it right? If it is,
>> then I
>> >need to count the time to Dump or Store the result as the time for the
>> >different orders' join.
>> >
>> >Bests,
>> >Mingda
>> >
>> >
>> >
>> >On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai <[email protected]>
>> wrote:
>> >
>> >> Hi, Mingda,
>> >>
>> >> Pig does not do join reordering and will execute the query as the way
>> it
>> >> is written. Note you can join multiple relations in one join statement.
>> >>
>> >> Do you want execution time for each join in your statement? I assume
>> you
>> >> are using regular join and running with MapReduce, every join statement
>> >> will be a separate MapReduce job and the join runtime is the runtime
>> for
>> >> its MapReduce job.
>> >>
>> >> Thanks,
>> >> Daniel
>> >>
>> >>
>> >>
>> >> On 10/31/16, 8:21 PM, "mingda li" <[email protected]> wrote:
>> >>
>> >> >Dear all,
>> >> >
>> >> >I am doing optimization for multiple join. I am not sure if Pig can
>> decide
>> >> >the join order in optimization layer. Does anyone know about this? Or
>> Pig
>> >> >just execute the query as the way it is written.
>> >> >
>> >> >And, I want to do the multiple way Join on different keys. Can the
>> >> >following query work?
>> >> >
>> >> >Res =
>> >> >JOIN
>> >> >(JOIN catalog_sales BY cs_item_sk, inventory BY  inv_item_sk) BY
>> >> >(cs_item_sk, cs_order_number), catalog_returns BY (cr_item_sk,
>> >> >cr_order_number);
>> >> >
>> >> >BTW, each time, I run the query, it is finished in one second. Is
>> there a
>> >> >way to see the execution time? I have set the  pig.udf.profile=true.
>> Where
>> >> >can I find the time?
>> >> >
>> >> >Bests,
>> >> >Mingda
>> >>
>>
>
>

Re: About Multiple Join in Pig

Reply via email to