Re: About Multiple Join in Pig

Debabrata Pani Wed, 02 Nov 2016 23:27:32 -0700

Just to be doubly sure can you share the error inside the log file
mentioned in the output ?


On Nov 3, 2016 10:12, "mingda li" <[email protected]> wrote:

> My query is as following:
>
> pig
> -Dpig.additional.jars=/home/hadoop-user/pig-branch-0.lib/
> datafu-pig-incubating-1.3.1.jar
>
>
> To open pig.
>
> Then, input:
>
>
> *REGISTER*
> /home/hadoop-user/pig-branch-0.15/lib/datafu-pig-incubating-1.3.1.jar
>
> data = LOAD 'hdfs://SCAI01.CS.UCLA.EDU:9000/clash/datasets/1.txt' using
> PigStorage() as (val:int);
>
> define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
>
> dat= FOREACH data GENERATE MurmurH32(val);
>
> On Wed, Nov 2, 2016 at 9:35 PM, mingda li <[email protected]> wrote:
>
> > En, thanks Debabrata, but actually, I register each time ( forget to tell
> > you) before i run the commands.
> > I use *REGISTER* /home/hadoop-user/pig-branch-0.15/lib/datafu-pig-
> > incubating-1.3.1.jar.
> > But cannot help me.
> >
> > Any other reason?
> >
> > Thanks
> >
> > On Wed, Nov 2, 2016 at 8:03 PM, Debabrata Pani <[email protected]>
> > wrote:
> >
> >> It says that pig could not find the class Hasher. Start grunt with
> >> -Dpig.additional.jars (before other pig arguments) or do a "register" of
> >> individual jars before typing in your scripts.
> >>
> >> Regards,
> >> Debabrata
> >>
> >> On Nov 3, 2016 07:09, "mingda li" <[email protected]> wrote:
> >>
> >> > Thanks. I have tried to install the datafu and finish quickstart
> >> > successfully http://datafu.incubator.apache.org/docs/quick-start.html
> >> >
> >> > But when i use the murmur hash, it failed. I do not know why.
> >> >
> >> > grunt>  data = LOAD 'hdfs://***.UCLA.EDU:9000/clash/datasets/1.txt'
> >> using
> >> > PigStorage() as (val:int);
> >> >
> >> > grunt> data_out = FOREACH data GENERATE val;
> >> >
> >> > grunt> dat= FOREACH data GENERATE MurmurH32(val);
> >> >
> >> > 2016-11-02 18:25:18,424 [main] ERROR org.apache.pig.tools.grunt.Grunt
> -
> >> > ERROR 1070: Could not resolve datafu.pig.hash.Hasher using imports: [,
> >> > java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
> >> >
> >> > Details at logfile: /home/hadoop-user/pig-branch-
> >> > 0.15/bin/pig_1478136031217.log
> >> >
> >> >
> >> > The log file is in attachment.
> >> >
> >> >
> >> > Bests,
> >> >
> >> > Mingda
> >> >
> >> >
> >> > On Wed, Nov 2, 2016 at 2:04 PM, Daniel Dai <[email protected]>
> >> wrote:
> >> >
> >> >> I see datafu has a patch for the UDF: https://issues.apache.org/jira
> >> >> /browse/DATAFU-47
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 11/2/16, 11:45 AM, "mingda li" <[email protected]> wrote:
> >> >>
> >> >> >Dear all,
> >> >> >
> >> >> >Hi, now I wants to import a UDF function to pig command. Has anyone
> >> ever
> >> >> >done so? I want to import google's guava/murmur3_32 to pig. Could
> >> anyone
> >> >> >give some useful materials or suggestion？
> >> >> >
> >> >> >Bests,
> >> >> >Mingda
> >> >> >
> >> >> >On Wed, Nov 2, 2016 at 2:11 AM, mingda li <[email protected]>
> >> >> wrote:
> >> >> >
> >> >> >> Yeah, I see. Thanks for your reply.
> >> >> >>
> >> >> >> Bests,
> >> >> >> Mingda
> >> >> >>
> >> >> >> On Tue, Nov 1, 2016 at 9:20 PM, Daniel Dai <[email protected]
> >
> >> >> wrote:
> >> >> >>
> >> >> >>> Yes, you need to dump/store xxx_OrderRes to kick off the job. You
> >> will
> >> >> >>> see two MapReduce jobs corresponding to the first and second
> join.
> >> >> >>>
> >> >> >>> Thanks,
> >> >> >>> Daniel
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> On 11/1/16, 10:52 AM, "mingda li" <[email protected]>
> wrote:
> >> >> >>>
> >> >> >>> >Dear Dai,
> >> >> >>> >
> >> >> >>> >Thanks for your reply.
> >> >> >>> >What I want to do is to compare the two different order of join.
> >> The
> >> >> >>> query
> >> >> >>> >is as following:
> >> >> >>> >
> >> >> >>> >*Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
> >> >> >>> cs_item_sk;*
> >> >> >>> >*Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk,
> >> cs_order_number),
> >> >> >>> >catalog_returns BY (cr_item_sk, cr_order_number);*
> >> >> >>> >*Dump or Store Bad_OrderRes;*
> >> >> >>> >
> >> >> >>> >*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk,
> >> >> cr_order_number),
> >> >> >>> >catalog_sales BY (cs_item_sk, cs_order_number);*
> >> >> >>> >*Good_OrderRes = JOIN Good_OrderIn  BY  cs_item_sk, inventory BY
> >> >> >>> > inv_item_sk;*
> >> >> >>> >*Dump or Store Good_OrderRes;*
> >> >> >>> >
> >> >> >>> >Since Pig execute the query lazily, I think only by Dump or
> Store
> >> the
> >> >> >>> >result, I can know the time of MapReduce Job, is it right? If it
> >> is,
> >> >> >>> then I
> >> >> >>> >need to count the time to Dump or Store the result as the time
> for
> >> >> the
> >> >> >>> >different orders' join.
> >> >> >>> >
> >> >> >>> >Bests,
> >> >> >>> >Mingda
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai <
> >> [email protected]>
> >> >> >>> wrote:
> >> >> >>> >
> >> >> >>> >> Hi, Mingda,
> >> >> >>> >>
> >> >> >>> >> Pig does not do join reordering and will execute the query as
> >> the
> >> >> way
> >> >> >>> it
> >> >> >>> >> is written. Note you can join multiple relations in one join
> >> >> statement.
> >> >> >>> >>
> >> >> >>> >> Do you want execution time for each join in your statement? I
> >> >> assume
> >> >> >>> you
> >> >> >>> >> are using regular join and running with MapReduce, every join
> >> >> statement
> >> >> >>> >> will be a separate MapReduce job and the join runtime is the
> >> >> runtime
> >> >> >>> for
> >> >> >>> >> its MapReduce job.
> >> >> >>> >>
> >> >> >>> >> Thanks,
> >> >> >>> >> Daniel
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> On 10/31/16, 8:21 PM, "mingda li" <[email protected]>
> >> wrote:
> >> >> >>> >>
> >> >> >>> >> >Dear all,
> >> >> >>> >> >
> >> >> >>> >> >I am doing optimization for multiple join. I am not sure if
> Pig
> >> >> can
> >> >> >>> decide
> >> >> >>> >> >the join order in optimization layer. Does anyone know about
> >> >> this? Or
> >> >> >>> Pig
> >> >> >>> >> >just execute the query as the way it is written.
> >> >> >>> >> >
> >> >> >>> >> >And, I want to do the multiple way Join on different keys.
> Can
> >> the
> >> >> >>> >> >following query work?
> >> >> >>> >> >
> >> >> >>> >> >Res =
> >> >> >>> >> >JOIN
> >> >> >>> >> >(JOIN catalog_sales BY cs_item_sk, inventory BY  inv_item_sk)
> >> BY
> >> >> >>> >> >(cs_item_sk, cs_order_number), catalog_returns BY
> (cr_item_sk,
> >> >> >>> >> >cr_order_number);
> >> >> >>> >> >
> >> >> >>> >> >BTW, each time, I run the query, it is finished in one
> second.
> >> Is
> >> >> >>> there a
> >> >> >>> >> >way to see the execution time? I have set the
> >> >> pig.udf.profile=true.
> >> >> >>> Where
> >> >> >>> >> >can I find the time?
> >> >> >>> >> >
> >> >> >>> >> >Bests,
> >> >> >>> >> >Mingda
> >> >> >>> >>
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >>
> >> >
> >> >
> >>
> >
> >
>

Re: About Multiple Join in Pig

Reply via email to