Re: About Multiple Join in Pig

Rohini Palaniswamy Tue, 29 Nov 2016 15:49:28 -0800

You can check if your current jar has the class by running

jar -tvf /home/hadoop-user/pig-branch-0.15/lib/datafu-pig-incubating-1.3.1.jar
| grep Hasher


Did you compile datafu after applying patch from
https://issues.apache.org/jira/browse/DATAFU-47 ? Only then the class will
be in the jar as that patch is not committed and part of any datafu release
yet.




On Thu, Nov 3, 2016 at 6:11 PM, mingda li <limingda1...@gmail.com> wrote:

> Anyone have idea about the problem? I still cannot solve it.
>
> On Wed, Nov 2, 2016 at 11:33 PM, mingda li <limingda1...@gmail.com> wrote:
>
> > Yeah, the log file's content is as following:
> >
> >   1 Pig Stack Trace
> >
> >   2 ---------------
> >
> >   3 ERROR 1070: Could not resolve datafu.pig.hash.Hasher using imports:
> > [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
> >
> >   4
> >
> >   5 Failed to parse: Pig script failed to parse:
> >
> >   6 <line 3, column 27> Failed to generate logical plan. Nested
> > exception: org.apache.pig.backend.executionengine.ExecException: ERROR
> > 1070: Could not resolve datafu.pig.hash.Hasher using imports: [, java
> .lang.,
> > org.apache.pig.builtin., org.apache.pig.impl.builtin.]
> >
> >   7         at org.apache.pig.parser.QueryParserDriver.parse(
> > QueryParserDriver.java:199)
> >
> >   8         at org.apache.pig.PigServer$Graph.validateQuery(PigServer.
> > java:1707)
> >
> >   9         at org.apache.pig.PigServer$Graph.registerQuery(PigServer.
> > java:1680)
> >
> >  10         at org.apache.pig.PigServer.registerQuery(PigServer.java:
> 623)
> >
> >  11         at org.apache.pig.tools.grunt.GruntParser.processPig(
> > GruntParser.java:1082)
> >
> >  12         at org.apache.pig.tools.pigscript.parser.
> > PigScriptParser.parse(PigScriptParser.java:505)
> >
> >  13         at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(
> > GruntParser.java:230)
> >
> >  14         at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(
> > GruntParser.java:205)
> >
> >  15         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
> >
> >  16         at org.apache.pig.Main.run(Main.java:565)
> >
> >  17         at org.apache.pig.Main.main(Main.java:177)
> >
> >  18         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> >
> >  19         at sun.reflect.NativeMethodAccessorImpl.invoke(
> > NativeMethodAccessorImpl.java:57)
> >
> >  20         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > DelegatingMethodAccessorImpl.java:43)
> >
> >  21         at java.lang.reflect.Method.invoke(Method.java:606)
> >
> >  22         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >
> >  23 Caused by:
> >
> >  24 <line 3, column 27> Failed to generate logical plan. Nested
> > exception: org.apache.pig.backend.executionengine.ExecException: ERROR
> > 1070: Could not resolve datafu.pig.hash.Hasher using imports: [, java
> .lang.,
> > org.apache.pig.builtin., org.apache.pig.impl.builtin.]
> >
> >  25         at org.apache.pig.parser.LogicalPlanBuilder.buildUDF(
> > LogicalPlanBuilder.java:1572)
> >
> >  26         at org.apache.pig.parser.LogicalPlanGenerator.func_
> > eval(LogicalPlanGenerator.java:9403)
> >
> >  27         at org.apache.pig.parser.LogicalPlanGenerator.
> > projectable_expr(LogicalPlanGenerator.java:11082)
> >
> >  28         at org.apache.pig.parser.LogicalPlanGenerator.var_expr(
> > LogicalPlanGenerator.java:10841)
> >
> >  29         at org.apache.pig.parser.LogicalPlanGenerator.expr(
> > LogicalPlanGenerator.java:10190)
> >
> >  30         at org.apache.pig.parser.LogicalPlanGenerator.flatten_
> > generated_item(LogicalPlanGenerator.java:7519)
> >
> >  31         at org.apache.pig.parser.LogicalPlanGenerator.generate_
> > clause(LogicalPlanGenerator.java:17621)
> >
> >  32         at org.apache.pig.parser.LogicalPlanGenerator.foreach_
> > plan(LogicalPlanGenerator.java:16013)
> >
> >  33         at org.apache.pig.parser.LogicalPlanGenerator.foreach_
> > clause(LogicalPlanGenerator.java:15880)
> >
> >  34         at org.apache.pig.parser.LogicalPlanGenerator.op_
> > clause(LogicalPlanGenerator.java:1933)
> >
> >  35         at org.apache.pig.parser.LogicalPlanGenerator.general_
> > statement(LogicalPlanGenerator.java:1102)
> >
> >  36         at org.apache.pig.parser.LogicalPlanGenerator.statement(
> > LogicalPlanGenerator.java:560)
> >
> >  37         at org.apache.pig.parser.LogicalPlanGenerator.query(
> > LogicalPlanGenerator.java:421)
> >
> >  38         at org.apache.pig.parser.QueryParserDriver.parse(
> > QueryParserDriver.java:191)
> >
> >  39         ... 15 more
> >
> >  40 Caused by: org.apache.pig.backend.executionengine.ExecException:
> > ERROR 1070: Could not resolve datafu.pig.hash.Hasher using imports: [,
> > java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin    .]
> >
> >  41         at org.apache.pig.impl.PigContext.resolveClassName(
> > PigContext.java:677)
> >
> >  42         at org.apache.pig.impl.PigContext.getClassForAlias(
> > PigContext.java:793)
> >
> >  43         at org.apache.pig.parser.LogicalPlanBuilder.buildUDF(
> > LogicalPlanBuilder.java:1569)
> >
> >  44         ... 28 more
> >
> >  45 =========================
> >
> >
> > On Wed, Nov 2, 2016 at 11:27 PM, Debabrata Pani <android.p...@gmail.com>
> > wrote:
> >
> >> Just to be doubly sure can you share the error inside the log file
> >> mentioned in the output ?
> >>
> >> On Nov 3, 2016 10:12, "mingda li" <limingda1...@gmail.com> wrote:
> >>
> >> > My query is as following:
> >> >
> >> > pig
> >> > -Dpig.additional.jars=/home/hadoop-user/pig-branch-0.lib/
> >> > datafu-pig-incubating-1.3.1.jar
> >> >
> >> >
> >> > To open pig.
> >> >
> >> > Then, input:
> >> >
> >> >
> >> > *REGISTER*
> >> > /home/hadoop-user/pig-branch-0.15/lib/datafu-pig-incubating-1.3.1.jar
> >> >
> >> > data = LOAD 'hdfs://SCAI01.CS.UCLA.EDU:9000/clash/datasets/1.txt'
> using
> >> > PigStorage() as (val:int);
> >> >
> >> > define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
> >> >
> >> > dat= FOREACH data GENERATE MurmurH32(val);
> >> >
> >> > On Wed, Nov 2, 2016 at 9:35 PM, mingda li <limingda1...@gmail.com>
> >> wrote:
> >> >
> >> > > En, thanks Debabrata, but actually, I register each time ( forget to
> >> tell
> >> > > you) before i run the commands.
> >> > > I use *REGISTER* /home/hadoop-user/pig-branch-0.15/lib/datafu-pig-
> >> > > incubating-1.3.1.jar.
> >> > > But cannot help me.
> >> > >
> >> > > Any other reason?
> >> > >
> >> > > Thanks
> >> > >
> >> > > On Wed, Nov 2, 2016 at 8:03 PM, Debabrata Pani <
> >> android.p...@gmail.com>
> >> > > wrote:
> >> > >
> >> > >> It says that pig could not find the class Hasher. Start grunt with
> >> > >> -Dpig.additional.jars (before other pig arguments) or do a
> >> "register" of
> >> > >> individual jars before typing in your scripts.
> >> > >>
> >> > >> Regards,
> >> > >> Debabrata
> >> > >>
> >> > >> On Nov 3, 2016 07:09, "mingda li" <limingda1...@gmail.com> wrote:
> >> > >>
> >> > >> > Thanks. I have tried to install the datafu and finish quickstart
> >> > >> > successfully http://datafu.incubator.apache
> >> .org/docs/quick-start.html
> >> > >> >
> >> > >> > But when i use the murmur hash, it failed. I do not know why.
> >> > >> >
> >> > >> > grunt>  data = LOAD 'hdfs://***.UCLA.EDU:9000/
> clash/datasets/1.txt
> >> '
> >> > >> using
> >> > >> > PigStorage() as (val:int);
> >> > >> >
> >> > >> > grunt> data_out = FOREACH data GENERATE val;
> >> > >> >
> >> > >> > grunt> dat= FOREACH data GENERATE MurmurH32(val);
> >> > >> >
> >> > >> > 2016-11-02 18:25:18,424 [main] ERROR
> org.apache.pig.tools.grunt.Gru
> >> nt
> >> > -
> >> > >> > ERROR 1070: Could not resolve datafu.pig.hash.Hasher using
> >> imports: [,
> >> > >> > java.lang., org.apache.pig.builtin.,
> org.apache.pig.impl.builtin.]
> >> > >> >
> >> > >> > Details at logfile: /home/hadoop-user/pig-branch-
> >> > >> > 0.15/bin/pig_1478136031217.log
> >> > >> >
> >> > >> >
> >> > >> > The log file is in attachment.
> >> > >> >
> >> > >> >
> >> > >> > Bests,
> >> > >> >
> >> > >> > Mingda
> >> > >> >
> >> > >> >
> >> > >> > On Wed, Nov 2, 2016 at 2:04 PM, Daniel Dai <
> da...@hortonworks.com>
> >> > >> wrote:
> >> > >> >
> >> > >> >> I see datafu has a patch for the UDF:
> >> https://issues.apache.org/jira
> >> > >> >> /browse/DATAFU-47
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >> On 11/2/16, 11:45 AM, "mingda li" <limingda1...@gmail.com>
> wrote:
> >> > >> >>
> >> > >> >> >Dear all,
> >> > >> >> >
> >> > >> >> >Hi, now I wants to import a UDF function to pig command. Has
> >> anyone
> >> > >> ever
> >> > >> >> >done so? I want to import google's guava/murmur3_32 to pig.
> Could
> >> > >> anyone
> >> > >> >> >give some useful materials or suggestion？
> >> > >> >> >
> >> > >> >> >Bests,
> >> > >> >> >Mingda
> >> > >> >> >
> >> > >> >> >On Wed, Nov 2, 2016 at 2:11 AM, mingda li <
> >> limingda1...@gmail.com>
> >> > >> >> wrote:
> >> > >> >> >
> >> > >> >> >> Yeah, I see. Thanks for your reply.
> >> > >> >> >>
> >> > >> >> >> Bests,
> >> > >> >> >> Mingda
> >> > >> >> >>
> >> > >> >> >> On Tue, Nov 1, 2016 at 9:20 PM, Daniel Dai <
> >> da...@hortonworks.com
> >> > >
> >> > >> >> wrote:
> >> > >> >> >>
> >> > >> >> >>> Yes, you need to dump/store xxx_OrderRes to kick off the
> job.
> >> You
> >> > >> will
> >> > >> >> >>> see two MapReduce jobs corresponding to the first and second
> >> > join.
> >> > >> >> >>>
> >> > >> >> >>> Thanks,
> >> > >> >> >>> Daniel
> >> > >> >> >>>
> >> > >> >> >>>
> >> > >> >> >>>
> >> > >> >> >>> On 11/1/16, 10:52 AM, "mingda li" <limingda1...@gmail.com>
> >> > wrote:
> >> > >> >> >>>
> >> > >> >> >>> >Dear Dai,
> >> > >> >> >>> >
> >> > >> >> >>> >Thanks for your reply.
> >> > >> >> >>> >What I want to do is to compare the two different order of
> >> join.
> >> > >> The
> >> > >> >> >>> query
> >> > >> >> >>> >is as following:
> >> > >> >> >>> >
> >> > >> >> >>> >*Bad_OrderIn = JOIN inventory BY  inv_item_sk,
> catalog_sales
> >> BY
> >> > >> >> >>> cs_item_sk;*
> >> > >> >> >>> >*Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk,
> >> > >> cs_order_number),
> >> > >> >> >>> >catalog_returns BY (cr_item_sk, cr_order_number);*
> >> > >> >> >>> >*Dump or Store Bad_OrderRes;*
> >> > >> >> >>> >
> >> > >> >> >>> >*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk,
> >> > >> >> cr_order_number),
> >> > >> >> >>> >catalog_sales BY (cs_item_sk, cs_order_number);*
> >> > >> >> >>> >*Good_OrderRes = JOIN Good_OrderIn  BY  cs_item_sk,
> >> inventory BY
> >> > >> >> >>> > inv_item_sk;*
> >> > >> >> >>> >*Dump or Store Good_OrderRes;*
> >> > >> >> >>> >
> >> > >> >> >>> >Since Pig execute the query lazily, I think only by Dump or
> >> > Store
> >> > >> the
> >> > >> >> >>> >result, I can know the time of MapReduce Job, is it right?
> >> If it
> >> > >> is,
> >> > >> >> >>> then I
> >> > >> >> >>> >need to count the time to Dump or Store the result as the
> >> time
> >> > for
> >> > >> >> the
> >> > >> >> >>> >different orders' join.
> >> > >> >> >>> >
> >> > >> >> >>> >Bests,
> >> > >> >> >>> >Mingda
> >> > >> >> >>> >
> >> > >> >> >>> >
> >> > >> >> >>> >
> >> > >> >> >>> >On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai <
> >> > >> da...@hortonworks.com>
> >> > >> >> >>> wrote:
> >> > >> >> >>> >
> >> > >> >> >>> >> Hi, Mingda,
> >> > >> >> >>> >>
> >> > >> >> >>> >> Pig does not do join reordering and will execute the
> query
> >> as
> >> > >> the
> >> > >> >> way
> >> > >> >> >>> it
> >> > >> >> >>> >> is written. Note you can join multiple relations in one
> >> join
> >> > >> >> statement.
> >> > >> >> >>> >>
> >> > >> >> >>> >> Do you want execution time for each join in your
> >> statement? I
> >> > >> >> assume
> >> > >> >> >>> you
> >> > >> >> >>> >> are using regular join and running with MapReduce, every
> >> join
> >> > >> >> statement
> >> > >> >> >>> >> will be a separate MapReduce job and the join runtime is
> >> the
> >> > >> >> runtime
> >> > >> >> >>> for
> >> > >> >> >>> >> its MapReduce job.
> >> > >> >> >>> >>
> >> > >> >> >>> >> Thanks,
> >> > >> >> >>> >> Daniel
> >> > >> >> >>> >>
> >> > >> >> >>> >>
> >> > >> >> >>> >>
> >> > >> >> >>> >> On 10/31/16, 8:21 PM, "mingda li" <
> limingda1...@gmail.com>
> >> > >> wrote:
> >> > >> >> >>> >>
> >> > >> >> >>> >> >Dear all,
> >> > >> >> >>> >> >
> >> > >> >> >>> >> >I am doing optimization for multiple join. I am not sure
> >> if
> >> > Pig
> >> > >> >> can
> >> > >> >> >>> decide
> >> > >> >> >>> >> >the join order in optimization layer. Does anyone know
> >> about
> >> > >> >> this? Or
> >> > >> >> >>> Pig
> >> > >> >> >>> >> >just execute the query as the way it is written.
> >> > >> >> >>> >> >
> >> > >> >> >>> >> >And, I want to do the multiple way Join on different
> keys.
> >> > Can
> >> > >> the
> >> > >> >> >>> >> >following query work?
> >> > >> >> >>> >> >
> >> > >> >> >>> >> >Res =
> >> > >> >> >>> >> >JOIN
> >> > >> >> >>> >> >(JOIN catalog_sales BY cs_item_sk, inventory BY
> >> inv_item_sk)
> >> > >> BY
> >> > >> >> >>> >> >(cs_item_sk, cs_order_number), catalog_returns BY
> >> > (cr_item_sk,
> >> > >> >> >>> >> >cr_order_number);
> >> > >> >> >>> >> >
> >> > >> >> >>> >> >BTW, each time, I run the query, it is finished in one
> >> > second.
> >> > >> Is
> >> > >> >> >>> there a
> >> > >> >> >>> >> >way to see the execution time? I have set the
> >> > >> >> pig.udf.profile=true.
> >> > >> >> >>> Where
> >> > >> >> >>> >> >can I find the time?
> >> > >> >> >>> >> >
> >> > >> >> >>> >> >Bests,
> >> > >> >> >>> >> >Mingda
> >> > >> >> >>> >>
> >> > >> >> >>>
> >> > >> >> >>
> >> > >> >> >>
> >> > >> >>
> >> > >> >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Re: About Multiple Join in Pig

Reply via email to