Re: HCatalog scans all partition even after mentioning date filter

Thejas Nair Wed, 25 Apr 2012 14:05:22 -0700

yes, please create one.
Thanks,
Thejas

On 4/25/12 1:47 PM, Aniket Mokashi wrote:

Hi Dmitriy and Thejas,


Should I open a jira for the same?

Thanks,
Aniket


On Wed, Apr 25, 2012 at 1:45 PM, Dmitriy Ryaboy <[email protected]
<mailto:[email protected]>> wrote:

    Yeah I think we just need to get projection pushdown to work through
    Split operators.

    D

    On Wed, Apr 25, 2012 at 12:52 PM, Thejas Nair
    <[email protected] <mailto:[email protected]>> wrote:
     > cc'ing dev@pig as this is a pig issue.
     >
     > Aniket, What you saw is not related to PIG-2339 .
     >
     > In your example query, the logical plan will look like this -
     >
     > Load (A)
     > |
     > Split
     >  |
     > ---------------------------
     > |             |
     > Filter(B1)   Filter(B2) ...
     >
     > Because of the split operator introduced between the filter
    conditions and
     > load, the filter does not get pushed into the load function.
     >
     > A simple way to fix this in pig would be to not share the load
    across the
     > filter operators. Another option is to push the condition (B1 or
    B2 or B3)
     > into Load operator and retain rest of the current plan (split and
    filters
     > following the split).
     >
     > You can ofcourse achieve the same effect by having a separate load
     > statememnt as input for each of the filters.
     >
     > I agree that we should make it possible to ask pig to throw a
    warning/error
     > if the query is going to result in a full table scan on a
    partitioned table.
     >
     > Thanks,
     > Thejas
     >
     >
     >
     >
     > On 4/24/12 7:56 PM, Aniket Mokashi wrote:
     >>
     >> Sorry Thejas, I didnt look into the jira properly earlier.
     >> EMR pig-0.9.1 already has that patch for PIG-2339 and hence I
    did not
     >> hit that issue earlier (and I patched datanucleus). filter-union
    was a
     >> workaround I was using to avoid some of the thrift timeout problems
     >> earlier. Thrift api's timeout on client side in 20sec by default (I
     >> found the config to change this later) and I hence used a = load
     >> 'table'; b1= filter by cond1; b2=filter by cond2;.. b= union b1,
    b2..;
     >> to expect to push these filters separately to the loader. But, that
     >> doesn't work in pig. (I can open a jira, but I havent done enough
     >> investigation at the code level). Thoughts?
     >>
     >> Thanks,
     >> Aniket
     >>
     >> On Tue, Apr 24, 2012 at 7:00 PM, Thejas Nair
    <[email protected] <mailto:[email protected]>
     >> <mailto:[email protected] <mailto:[email protected]>>>
    wrote:
     >>
     >>    The issue was not specific to filter-union
     >>    - https://issues.apache.org/__jira/browse/PIG-2339
     >> <https://issues.apache.org/jira/browse/PIG-2339>.
     >>    The fix was to do filter PushUpFilter before
    PartitionFilterOptimizer .
     >>
     >>    As this is not a hcat issue, it should not matter if you have an
     >>    older hcat version .  fyi, this bug was not there in pig 0.8.x .
     >>    Was it pig 0.9.0 or 0.9.1 that you used ?
     >>
     >>    Thanks,
     >>    Thejas
     >>
     >>
     >>
     >>    On 4/24/12 5:21 PM, Aniket Mokashi wrote:
     >>
     >>        Hi Thejas,
     >>
     >>        Can you point me to jira that fixes filter-union problem
    (in pig)?
     >> I
     >>        haven't tried hcat-0.4 yet, good to know about that issue. I
     >>        will keep a
     >>        watcher.
     >>
     >>        Thanks,
     >>        Aniket
     >>
     >>        On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair
     >> <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
     >> <mailto:[email protected] <mailto:[email protected]>
     >> <mailto:[email protected]
    <mailto:[email protected]>>__>> wrote:
     >>
     >>            Hi Aniket,
     >>            Are you using pig 0.9 or 0.9.1 ?
     >>            If yes, can you try with pig 0.9.2 ?
     >>            Wondering if you are also hitting the issue that Thomas
     >>        mentioned .
     >>
     >>            Thanks,
     >>            Thejas
     >>
     >>
     >>
     >>
     >>            On 4/23/12 7:39 PM, Aniket Mokashi wrote:
     >>
     >>                Something similar I have noticed is -
     >>
     >>                A = load ...
     >>                B1 = filter A by cond1;
     >>                B2 = filter A by cond2;
     >>                B3 = filter A by cond3;
     >>
     >>                B = union B1, B2, B3; does not push projection.
     >>
     >>                Is that expected?
     >>
     >>                Ideally, we should have "strict" mode under hcatalog,
     >>        that when
     >>                turned
     >>                on will avoid executing pig queries on the full
     >>        (partitioned) table.
     >>
     >>                Thanks,
     >>                Aniket
     >>
     >>                On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
     >> <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
     >> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>__gmail.com
    <http://gmail.com>
     >> <mailto:[email protected]
    <mailto:[email protected]>>>
     >> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>
     >> <mailto:rajesh.balamohan@
    <mailto:rajesh.balamohan@>>__gma__il.com <http://gma__il.com>
    <http://gmail.com>
     >>
     >> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>__gmail.com
    <http://gmail.com>
     >> <mailto:[email protected]
    <mailto:[email protected]>>>>> wrote:
     >>
     >>                    Hi Alan,
     >>
     >>                    Thanks for the quick response.
     >>
     >>                    I am using HCatalog 0.4.
     >>
     >>                    With simple PIG script it works great. HCatalog
     >>        beautifully
     >>                scans
     >>                    only the relevant information. However, full scan
     >>        happens
     >>                only when
     >>                    we have couple of additional joins and when we
     >>        change the
     >>                INNER JOIN
     >>                    order (we also use "using skewed").
     >>
     >>                    Though we have looked into the debug logs, we
    saw the
     >>                scanning of
     >>                    number of records from the JobTracker's counters
     >>        itself. Without
     >>                    pruning, the m/r job was pretty much scanning the
     >>        entire set
     >>                of rows.
     >>
     >>                    I am not sure if there is a corner case, where in
     >> "skewed"
     >>                join is
     >>                    trying to override the filtering.
     >>
     >>                    ~Rajesh.B
     >>
     >>
     >>
     >>                    On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates
     >> <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
     >> <mailto:[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>
     >> <mailto:[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
     >> <mailto:[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>__>__>
     >>
     >>                wrote:
     >>
     >>                        What version of HCatalog are you using?
      How do
     >>        you know
     >>                it is
     >>                        scanning all the partitions, does it say
    so in
     >>        the logs,
     >>                or are
     >>                        you getting all the records back?
     >>
     >>                        And yes, HCat is supposed to do partition
     >>        pruning so that it
     >>                        only scans the required partitions.
     >>
     >>                        Alan.
     >>
     >>                        On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan
     >> wrote:
     >>
     >> > Hi All,
     >> >
     >> > I have a hcatalog table "partitioned by (d string)".
     >> >
     >> > I have couple of days worth of data and when i run "show
     >>                        partitions" it provides the correct daa.
     >> >
     >> > d=20111215
     >> > d=20111216
     >> > d=20111217
     >> > d=20111218
     >> > d=20111219
     >> > d=20111220
     >> > d=20111221
     >> > d=20111222
     >> > d=20111223
     >> > d=20111224
     >> > d=20111225
     >> > d=20120415
     >> >
     >> > However, when I run PIG with "filter a by d == '20120415'",
     >>                        it ends up scanning all data.
     >> >
     >> > Is this a known bug/enhancement in HCatalog?. Ideally,
     >>                        shouldn't it scan only the d=20120415
    directory?
     >> >
     >> > Any pointers would be of great help.
     >> >
     >> >
     >> > --
     >> > ~Rajesh.B
     >>
     >>
     >>
     >>
     >>                    --
     >>                    ~Rajesh.B
     >>
     >>
     >>
     >>
     >>                --
     >> "...:::Aniket:::... Quetzalco@tl"
     >>
     >>
     >>
     >>
     >>
     >>        --
     >> "...:::Aniket:::... Quetzalco@tl"
     >>
     >>
     >>
     >>
     >>
     >> --
     >> "...:::Aniket:::... Quetzalco@tl"
     >
     >




--
"...:::Aniket:::... Quetzalco@tl"

Re: HCatalog scans all partition even after mentioning date filter

Reply via email to