Thanks. I have tried to install the datafu and finish quickstart successfully http://datafu.incubator.apache.org/docs/quick-start.html
But when i use the murmur hash, it failed. I do not know why. grunt> data = LOAD 'hdfs://***.UCLA.EDU:9000/clash/datasets/1.txt' using PigStorage() as (val:int); grunt> data_out = FOREACH data GENERATE val; grunt> dat= FOREACH data GENERATE MurmurH32(val); 2016-11-02 18:25:18,424 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve datafu.pig.hash.Hasher using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at logfile: /home/hadoop-user/pig-branch-0.15/bin/pig_1478136031217.log The log file is in attachment. Bests, Mingda On Wed, Nov 2, 2016 at 2:04 PM, Daniel Dai <da...@hortonworks.com> wrote: > I see datafu has a patch for the UDF: https://issues.apache.org/ > jira/browse/DATAFU-47 > > > > > On 11/2/16, 11:45 AM, "mingda li" <limingda1...@gmail.com> wrote: > > >Dear all, > > > >Hi, now I wants to import a UDF function to pig command. Has anyone ever > >done so? I want to import google's guava/murmur3_32 to pig. Could anyone > >give some useful materials or suggestion? > > > >Bests, > >Mingda > > > >On Wed, Nov 2, 2016 at 2:11 AM, mingda li <limingda1...@gmail.com> wrote: > > > >> Yeah, I see. Thanks for your reply. > >> > >> Bests, > >> Mingda > >> > >> On Tue, Nov 1, 2016 at 9:20 PM, Daniel Dai <da...@hortonworks.com> > wrote: > >> > >>> Yes, you need to dump/store xxx_OrderRes to kick off the job. You will > >>> see two MapReduce jobs corresponding to the first and second join. > >>> > >>> Thanks, > >>> Daniel > >>> > >>> > >>> > >>> On 11/1/16, 10:52 AM, "mingda li" <limingda1...@gmail.com> wrote: > >>> > >>> >Dear Dai, > >>> > > >>> >Thanks for your reply. > >>> >What I want to do is to compare the two different order of join. The > >>> query > >>> >is as following: > >>> > > >>> >*Bad_OrderIn = JOIN inventory BY inv_item_sk, catalog_sales BY > >>> cs_item_sk;* > >>> >*Bad_OrderRes = JOIN Bad_OrderIn BY (cs_item_sk, cs_order_number), > >>> >catalog_returns BY (cr_item_sk, cr_order_number);* > >>> >*Dump or Store Bad_OrderRes;* > >>> > > >>> >*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk, cr_order_number), > >>> >catalog_sales BY (cs_item_sk, cs_order_number);* > >>> >*Good_OrderRes = JOIN Good_OrderIn BY cs_item_sk, inventory BY > >>> > inv_item_sk;* > >>> >*Dump or Store Good_OrderRes;* > >>> > > >>> >Since Pig execute the query lazily, I think only by Dump or Store the > >>> >result, I can know the time of MapReduce Job, is it right? If it is, > >>> then I > >>> >need to count the time to Dump or Store the result as the time for the > >>> >different orders' join. > >>> > > >>> >Bests, > >>> >Mingda > >>> > > >>> > > >>> > > >>> >On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai <da...@hortonworks.com> > >>> wrote: > >>> > > >>> >> Hi, Mingda, > >>> >> > >>> >> Pig does not do join reordering and will execute the query as the > way > >>> it > >>> >> is written. Note you can join multiple relations in one join > statement. > >>> >> > >>> >> Do you want execution time for each join in your statement? I assume > >>> you > >>> >> are using regular join and running with MapReduce, every join > statement > >>> >> will be a separate MapReduce job and the join runtime is the runtime > >>> for > >>> >> its MapReduce job. > >>> >> > >>> >> Thanks, > >>> >> Daniel > >>> >> > >>> >> > >>> >> > >>> >> On 10/31/16, 8:21 PM, "mingda li" <limingda1...@gmail.com> wrote: > >>> >> > >>> >> >Dear all, > >>> >> > > >>> >> >I am doing optimization for multiple join. I am not sure if Pig can > >>> decide > >>> >> >the join order in optimization layer. Does anyone know about this? > Or > >>> Pig > >>> >> >just execute the query as the way it is written. > >>> >> > > >>> >> >And, I want to do the multiple way Join on different keys. Can the > >>> >> >following query work? > >>> >> > > >>> >> >Res = > >>> >> >JOIN > >>> >> >(JOIN catalog_sales BY cs_item_sk, inventory BY inv_item_sk) BY > >>> >> >(cs_item_sk, cs_order_number), catalog_returns BY (cr_item_sk, > >>> >> >cr_order_number); > >>> >> > > >>> >> >BTW, each time, I run the query, it is finished in one second. Is > >>> there a > >>> >> >way to see the execution time? I have set the > pig.udf.profile=true. > >>> Where > >>> >> >can I find the time? > >>> >> > > >>> >> >Bests, > >>> >> >Mingda > >>> >> > >>> > >> > >> >