I see datafu has a patch for the UDF: https://issues.apache.org/jira/browse/DATAFU-47
On 11/2/16, 11:45 AM, "mingda li" <limingda1...@gmail.com> wrote: >Dear all, > >Hi, now I wants to import a UDF function to pig command. Has anyone ever >done so? I want to import google's guava/murmur3_32 to pig. Could anyone >give some useful materials or suggestion? > >Bests, >Mingda > >On Wed, Nov 2, 2016 at 2:11 AM, mingda li <limingda1...@gmail.com> wrote: > >> Yeah, I see. Thanks for your reply. >> >> Bests, >> Mingda >> >> On Tue, Nov 1, 2016 at 9:20 PM, Daniel Dai <da...@hortonworks.com> wrote: >> >>> Yes, you need to dump/store xxx_OrderRes to kick off the job. You will >>> see two MapReduce jobs corresponding to the first and second join. >>> >>> Thanks, >>> Daniel >>> >>> >>> >>> On 11/1/16, 10:52 AM, "mingda li" <limingda1...@gmail.com> wrote: >>> >>> >Dear Dai, >>> > >>> >Thanks for your reply. >>> >What I want to do is to compare the two different order of join. The >>> query >>> >is as following: >>> > >>> >*Bad_OrderIn = JOIN inventory BY inv_item_sk, catalog_sales BY >>> cs_item_sk;* >>> >*Bad_OrderRes = JOIN Bad_OrderIn BY (cs_item_sk, cs_order_number), >>> >catalog_returns BY (cr_item_sk, cr_order_number);* >>> >*Dump or Store Bad_OrderRes;* >>> > >>> >*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk, cr_order_number), >>> >catalog_sales BY (cs_item_sk, cs_order_number);* >>> >*Good_OrderRes = JOIN Good_OrderIn BY cs_item_sk, inventory BY >>> > inv_item_sk;* >>> >*Dump or Store Good_OrderRes;* >>> > >>> >Since Pig execute the query lazily, I think only by Dump or Store the >>> >result, I can know the time of MapReduce Job, is it right? If it is, >>> then I >>> >need to count the time to Dump or Store the result as the time for the >>> >different orders' join. >>> > >>> >Bests, >>> >Mingda >>> > >>> > >>> > >>> >On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai <da...@hortonworks.com> >>> wrote: >>> > >>> >> Hi, Mingda, >>> >> >>> >> Pig does not do join reordering and will execute the query as the way >>> it >>> >> is written. Note you can join multiple relations in one join statement. >>> >> >>> >> Do you want execution time for each join in your statement? I assume >>> you >>> >> are using regular join and running with MapReduce, every join statement >>> >> will be a separate MapReduce job and the join runtime is the runtime >>> for >>> >> its MapReduce job. >>> >> >>> >> Thanks, >>> >> Daniel >>> >> >>> >> >>> >> >>> >> On 10/31/16, 8:21 PM, "mingda li" <limingda1...@gmail.com> wrote: >>> >> >>> >> >Dear all, >>> >> > >>> >> >I am doing optimization for multiple join. I am not sure if Pig can >>> decide >>> >> >the join order in optimization layer. Does anyone know about this? Or >>> Pig >>> >> >just execute the query as the way it is written. >>> >> > >>> >> >And, I want to do the multiple way Join on different keys. Can the >>> >> >following query work? >>> >> > >>> >> >Res = >>> >> >JOIN >>> >> >(JOIN catalog_sales BY cs_item_sk, inventory BY inv_item_sk) BY >>> >> >(cs_item_sk, cs_order_number), catalog_returns BY (cr_item_sk, >>> >> >cr_order_number); >>> >> > >>> >> >BTW, each time, I run the query, it is finished in one second. Is >>> there a >>> >> >way to see the execution time? I have set the pig.udf.profile=true. >>> Where >>> >> >can I find the time? >>> >> > >>> >> >Bests, >>> >> >Mingda >>> >> >>> >> >>