Re: About Multiple Join in Pig

Debabrata Pani Wed, 02 Nov 2016 20:04:05 -0700

It says that pig could not find the class Hasher. Start grunt with
-Dpig.additional.jars (before other pig arguments) or do a "register" of
individual jars before typing in your scripts.


Regards,
Debabrata

On Nov 3, 2016 07:09, "mingda li" <[email protected]> wrote:

> Thanks. I have tried to install the datafu and finish quickstart
> successfully http://datafu.incubator.apache.org/docs/quick-start.html
>
> But when i use the murmur hash, it failed. I do not know why.
>
> grunt>  data = LOAD 'hdfs://***.UCLA.EDU:9000/clash/datasets/1.txt' using
> PigStorage() as (val:int);
>
> grunt> data_out = FOREACH data GENERATE val;
>
> grunt> dat= FOREACH data GENERATE MurmurH32(val);
>
> 2016-11-02 18:25:18,424 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1070: Could not resolve datafu.pig.hash.Hasher using imports: [,
> java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
>
> Details at logfile: /home/hadoop-user/pig-branch-
> 0.15/bin/pig_1478136031217.log
>
>
> The log file is in attachment.
>
>
> Bests,
>
> Mingda
>
>
> On Wed, Nov 2, 2016 at 2:04 PM, Daniel Dai <[email protected]> wrote:
>
>> I see datafu has a patch for the UDF: https://issues.apache.org/jira
>> /browse/DATAFU-47
>>
>>
>>
>>
>> On 11/2/16, 11:45 AM, "mingda li" <[email protected]> wrote:
>>
>> >Dear all,
>> >
>> >Hi, now I wants to import a UDF function to pig command. Has anyone ever
>> >done so? I want to import google's guava/murmur3_32 to pig. Could anyone
>> >give some useful materials or suggestion？
>> >
>> >Bests,
>> >Mingda
>> >
>> >On Wed, Nov 2, 2016 at 2:11 AM, mingda li <[email protected]>
>> wrote:
>> >
>> >> Yeah, I see. Thanks for your reply.
>> >>
>> >> Bests,
>> >> Mingda
>> >>
>> >> On Tue, Nov 1, 2016 at 9:20 PM, Daniel Dai <[email protected]>
>> wrote:
>> >>
>> >>> Yes, you need to dump/store xxx_OrderRes to kick off the job. You will
>> >>> see two MapReduce jobs corresponding to the first and second join.
>> >>>
>> >>> Thanks,
>> >>> Daniel
>> >>>
>> >>>
>> >>>
>> >>> On 11/1/16, 10:52 AM, "mingda li" <[email protected]> wrote:
>> >>>
>> >>> >Dear Dai,
>> >>> >
>> >>> >Thanks for your reply.
>> >>> >What I want to do is to compare the two different order of join. The
>> >>> query
>> >>> >is as following:
>> >>> >
>> >>> >*Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY
>> >>> cs_item_sk;*
>> >>> >*Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
>> >>> >catalog_returns BY (cr_item_sk, cr_order_number);*
>> >>> >*Dump or Store Bad_OrderRes;*
>> >>> >
>> >>> >*Good_OrderIn = JOIN catalog_returns BY (cr_item_sk,
>> cr_order_number),
>> >>> >catalog_sales BY (cs_item_sk, cs_order_number);*
>> >>> >*Good_OrderRes = JOIN Good_OrderIn  BY  cs_item_sk, inventory BY
>> >>> > inv_item_sk;*
>> >>> >*Dump or Store Good_OrderRes;*
>> >>> >
>> >>> >Since Pig execute the query lazily, I think only by Dump or Store the
>> >>> >result, I can know the time of MapReduce Job, is it right? If it is,
>> >>> then I
>> >>> >need to count the time to Dump or Store the result as the time for
>> the
>> >>> >different orders' join.
>> >>> >
>> >>> >Bests,
>> >>> >Mingda
>> >>> >
>> >>> >
>> >>> >
>> >>> >On Tue, Nov 1, 2016 at 10:39 AM, Daniel Dai <[email protected]>
>> >>> wrote:
>> >>> >
>> >>> >> Hi, Mingda,
>> >>> >>
>> >>> >> Pig does not do join reordering and will execute the query as the
>> way
>> >>> it
>> >>> >> is written. Note you can join multiple relations in one join
>> statement.
>> >>> >>
>> >>> >> Do you want execution time for each join in your statement? I
>> assume
>> >>> you
>> >>> >> are using regular join and running with MapReduce, every join
>> statement
>> >>> >> will be a separate MapReduce job and the join runtime is the
>> runtime
>> >>> for
>> >>> >> its MapReduce job.
>> >>> >>
>> >>> >> Thanks,
>> >>> >> Daniel
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On 10/31/16, 8:21 PM, "mingda li" <[email protected]> wrote:
>> >>> >>
>> >>> >> >Dear all,
>> >>> >> >
>> >>> >> >I am doing optimization for multiple join. I am not sure if Pig
>> can
>> >>> decide
>> >>> >> >the join order in optimization layer. Does anyone know about
>> this? Or
>> >>> Pig
>> >>> >> >just execute the query as the way it is written.
>> >>> >> >
>> >>> >> >And, I want to do the multiple way Join on different keys. Can the
>> >>> >> >following query work?
>> >>> >> >
>> >>> >> >Res =
>> >>> >> >JOIN
>> >>> >> >(JOIN catalog_sales BY cs_item_sk, inventory BY  inv_item_sk) BY
>> >>> >> >(cs_item_sk, cs_order_number), catalog_returns BY (cr_item_sk,
>> >>> >> >cr_order_number);
>> >>> >> >
>> >>> >> >BTW, each time, I run the query, it is finished in one second. Is
>> >>> there a
>> >>> >> >way to see the execution time? I have set the
>> pig.udf.profile=true.
>> >>> Where
>> >>> >> >can I find the time?
>> >>> >> >
>> >>> >> >Bests,
>> >>> >> >Mingda
>> >>> >>
>> >>>
>> >>
>> >>
>>
>
>

Re: About Multiple Join in Pig

Reply via email to