Re: HCatalog scans all partition even after mentioning date filter

Thejas Nair Wed, 25 Apr 2012 12:53:22 -0700

cc'ing dev@pig as this is a pig issue.

Aniket, What you saw is not related to PIG-2339 .


In your example query, the logical plan will look like this -

Load (A)
|
Split
  |
---------------------------
|             |
Filter(B1)   Filter(B2) ...

Because of the split operator introduced between the filter conditionsand load, the filter does not get pushed into the load function.

A simple way to fix this in pig would be to not share the load acrossthe filter operators. Another option is to push the condition (B1 or B2or B3) into Load operator and retain rest of the current plan (split andfilters following the split).

You can ofcourse achieve the same effect by having a separate loadstatememnt as input for each of the filters.

I agree that we should make it possible to ask pig to throw awarning/error if the query is going to result in a full table scan on apartitioned table.


Thanks,
Thejas




On 4/24/12 7:56 PM, Aniket Mokashi wrote:

Sorry Thejas, I didnt look into the jira properly earlier.
EMR pig-0.9.1 already has that patch for PIG-2339 and hence I did not
hit that issue earlier (and I patched datanucleus). filter-union was a
workaround I was using to avoid some of the thrift timeout problems
earlier. Thrift api's timeout on client side in 20sec by default (I
found the config to change this later) and I hence used a = load
'table'; b1= filter by cond1; b2=filter by cond2;.. b= union b1, b2..;
to expect to push these filters separately to the loader. But, that
doesn't work in pig. (I can open a jira, but I havent done enough
investigation at the code level). Thoughts?

Thanks,
Aniket

On Tue, Apr 24, 2012 at 7:00 PM, Thejas Nair <[email protected]
<mailto:[email protected]>> wrote:

    The issue was not specific to filter-union
    - https://issues.apache.org/__jira/browse/PIG-2339
    <https://issues.apache.org/jira/browse/PIG-2339>.
    The fix was to do filter PushUpFilter before PartitionFilterOptimizer .

    As this is not a hcat issue, it should not matter if you have an
    older hcat version .  fyi, this bug was not there in pig 0.8.x .
    Was it pig 0.9.0 or 0.9.1 that you used ?

    Thanks,
    Thejas



    On 4/24/12 5:21 PM, Aniket Mokashi wrote:

        Hi Thejas,

        Can you point me to jira that fixes filter-union problem (in pig)? I
        haven't tried hcat-0.4 yet, good to know about that issue. I
        will keep a
        watcher.

        Thanks,
        Aniket

        On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected]
        <mailto:[email protected]>__>> wrote:

            Hi Aniket,
            Are you using pig 0.9 or 0.9.1 ?
            If yes, can you try with pig 0.9.2 ?
            Wondering if you are also hitting the issue that Thomas
        mentioned .

            Thanks,
            Thejas




            On 4/23/12 7:39 PM, Aniket Mokashi wrote:

                Something similar I have noticed is -

                A = load ...
                B1 = filter A by cond1;
                B2 = filter A by cond2;
                B3 = filter A by cond3;

                B = union B1, B2, B3; does not push projection.

                Is that expected?

                Ideally, we should have "strict" mode under hcatalog,
        that when
                turned
                on will avoid executing pig queries on the full
        (partitioned) table.

                Thanks,
                Aniket

                On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
        <[email protected] <mailto:[email protected]>
        <mailto:rajesh.balamohan@__gmail.com
        <mailto:[email protected]>>
        <mailto:rajesh.balamohan@
        <mailto:rajesh.balamohan@>__gma__il.com <http://gmail.com>

        <mailto:rajesh.balamohan@__gmail.com
        <mailto:[email protected]>>>> wrote:

                    Hi Alan,

                    Thanks for the quick response.

                    I am using HCatalog 0.4.

                    With simple PIG script it works great. HCatalog
        beautifully
                scans
                    only the relevant information. However, full scan
        happens
                only when
                    we have couple of additional joins and when we
        change the
                INNER JOIN
                    order (we also use "using skewed").

                    Though we have looked into the debug logs, we saw the
                scanning of
                    number of records from the JobTracker's counters
        itself. Without
                    pruning, the m/r job was pretty much scanning the
        entire set
                of rows.

                    I am not sure if there is a corner case, where in
        "skewed"
                join is
                    trying to override the filtering.

                    ~Rajesh.B



                    On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>__>__>

                wrote:

                        What version of HCatalog are you using?  How do
        you know
                it is
                        scanning all the partitions, does it say so in
        the logs,
                or are
                        you getting all the records back?

                        And yes, HCat is supposed to do partition
        pruning so that it
                        only scans the required partitions.

                        Alan.

                        On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan wrote:

         > Hi All,
         >
         > I have a hcatalog table "partitioned by (d string)".
         >
         > I have couple of days worth of data and when i run "show
                        partitions" it provides the correct daa.
         >
         > d=20111215
         > d=20111216
         > d=20111217
         > d=20111218
         > d=20111219
         > d=20111220
         > d=20111221
         > d=20111222
         > d=20111223
         > d=20111224
         > d=20111225
         > d=20120415
         >
         > However, when I run PIG with "filter a by d == '20120415'",
                        it ends up scanning all data.
         >
         > Is this a known bug/enhancement in HCatalog?. Ideally,
                        shouldn't it scan only the d=20120415 directory?
         >
         > Any pointers would be of great help.
         >
         >
         > --
         > ~Rajesh.B




                    --
                    ~Rajesh.B




                --
        "...:::Aniket:::... Quetzalco@tl"





        --
        "...:::Aniket:::... Quetzalco@tl"





--
"...:::Aniket:::... Quetzalco@tl"

Re: HCatalog scans all partition even after mentioning date filter

Reply via email to