Re: About Multiple Join in Pig

Daniel Dai Wed, 02 Nov 2016 14:05:37 -0700

I see datafu has a patch for the UDF: 
https://issues.apache.org/jira/browse/DATAFU-47





On 11/2/16, 11:45 AM, "mingda li" <[email protected]> wrote:

>Dear all,
>
>Hi, now I wants to import a UDF function to pig command. Has anyone ever
>done so? I want to import google's guava/murmur3_32 to pig. Could anyone
>give some useful materials or suggestion？
>
>Bests,
>Mingda
>
>On Wed, Nov 2, 2016 at 2:11 AM, mingda li <[email protected]> wrote:
>
>> Yeah, I see. Thanks for your reply.
>>
>> Bests,
>> Mingda
>>
>> On Tue, Nov 1, 2016 at 9:20 PM, Daniel Dai <[email protected]> wrote:
>>
>>> Yes, you need to dump/store xxx_OrderRes to kick off the job. You will
>>> see two MapReduce jobs corresponding to the first and second join.
>>>
>>> Thanks,
>>> Daniel
>>>
>>>
>>>
>>> On 11/1/16, 10:52 AM, "mingda li" <[email protected]> wrote:
>>>
>>> >Dear Dai,
>>> >
>>> >Thanks for your reply.
>>> >What I want to do is to compare the two different order of join. The
>>> query
>>> >is as following:
>>> >
>>> >*Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
>>> cs_item_sk;*
>>> >*Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
>>> >catalog_returns BY (cr_item_sk, cr_order_number);*
>>> >*Dump or Store Bad_OrderRes;*
>>> >
>>> >*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk, cr_order_number),
>>> >catalog_sales BY (cs_item_sk, cs_order_number);*
>>> >*Good_OrderRes = JOIN Good_OrderIn  BY  cs_item_sk, inventory BY
>>> > inv_item_sk;*
>>> >*Dump or Store Good_OrderRes;*
>>> >
>>> >Since Pig execute the query lazily, I think only by Dump or Store the
>>> >result, I can know the time of MapReduce Job, is it right? If it is,
>>> then I
>>> >need to count the time to Dump or Store the result as the time for the
>>> >different orders' join.
>>> >
>>> >Bests,
>>> >Mingda
>>> >
>>> >
>>> >
>>> >On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai <[email protected]>
>>> wrote:
>>> >
>>> >> Hi, Mingda,
>>> >>
>>> >> Pig does not do join reordering and will execute the query as the way
>>> it
>>> >> is written. Note you can join multiple relations in one join statement.
>>> >>
>>> >> Do you want execution time for each join in your statement? I assume
>>> you
>>> >> are using regular join and running with MapReduce, every join statement
>>> >> will be a separate MapReduce job and the join runtime is the runtime
>>> for
>>> >> its MapReduce job.
>>> >>
>>> >> Thanks,
>>> >> Daniel
>>> >>
>>> >>
>>> >>
>>> >> On 10/31/16, 8:21 PM, "mingda li" <[email protected]> wrote:
>>> >>
>>> >> >Dear all,
>>> >> >
>>> >> >I am doing optimization for multiple join. I am not sure if Pig can
>>> decide
>>> >> >the join order in optimization layer. Does anyone know about this? Or
>>> Pig
>>> >> >just execute the query as the way it is written.
>>> >> >
>>> >> >And, I want to do the multiple way Join on different keys. Can the
>>> >> >following query work?
>>> >> >
>>> >> >Res =
>>> >> >JOIN
>>> >> >(JOIN catalog_sales BY cs_item_sk, inventory BY  inv_item_sk) BY
>>> >> >(cs_item_sk, cs_order_number), catalog_returns BY (cr_item_sk,
>>> >> >cr_order_number);
>>> >> >
>>> >> >BTW, each time, I run the query, it is finished in one second. Is
>>> there a
>>> >> >way to see the execution time? I have set the  pig.udf.profile=true.
>>> Where
>>> >> >can I find the time?
>>> >> >
>>> >> >Bests,
>>> >> >Mingda
>>> >>
>>>
>>
>>

Re: About Multiple Join in Pig

Reply via email to