Re: HCatalog scans all partition even after mentioning date filter

Dmitriy Ryaboy Wed, 25 Apr 2012 13:45:45 -0700

Yeah I think we just need to get projection pushdown to work through
Split operators.


D

On Wed, Apr 25, 2012 at 12:52 PM, Thejas Nair <[email protected]> wrote:
> cc'ing dev@pig as this is a pig issue.
>
> Aniket, What you saw is not related to PIG-2339 .
>
> In your example query, the logical plan will look like this -
>
> Load (A)
> |
> Split
>  |
> ---------------------------
> |             |
> Filter(B1)   Filter(B2) ...
>
> Because of the split operator introduced between the filter conditions and
> load, the filter does not get pushed into the load function.
>
> A simple way to fix this in pig would be to not share the load across the
> filter operators. Another option is to push the condition (B1 or B2 or B3)
> into Load operator and retain rest of the current plan (split and filters
> following the split).
>
> You can ofcourse achieve the same effect by having a separate load
> statememnt as input for each of the filters.
>
> I agree that we should make it possible to ask pig to throw a warning/error
> if the query is going to result in a full table scan on a partitioned table.
>
> Thanks,
> Thejas
>
>
>
>
> On 4/24/12 7:56 PM, Aniket Mokashi wrote:
>>
>> Sorry Thejas, I didnt look into the jira properly earlier.
>> EMR pig-0.9.1 already has that patch for PIG-2339 and hence I did not
>> hit that issue earlier (and I patched datanucleus). filter-union was a
>> workaround I was using to avoid some of the thrift timeout problems
>> earlier. Thrift api's timeout on client side in 20sec by default (I
>> found the config to change this later) and I hence used a = load
>> 'table'; b1= filter by cond1; b2=filter by cond2;.. b= union b1, b2..;
>> to expect to push these filters separately to the loader. But, that
>> doesn't work in pig. (I can open a jira, but I havent done enough
>> investigation at the code level). Thoughts?
>>
>> Thanks,
>> Aniket
>>
>> On Tue, Apr 24, 2012 at 7:00 PM, Thejas Nair <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>    The issue was not specific to filter-union
>>    - https://issues.apache.org/__jira/browse/PIG-2339
>>    <https://issues.apache.org/jira/browse/PIG-2339>.
>>    The fix was to do filter PushUpFilter before PartitionFilterOptimizer .
>>
>>    As this is not a hcat issue, it should not matter if you have an
>>    older hcat version .  fyi, this bug was not there in pig 0.8.x .
>>    Was it pig 0.9.0 or 0.9.1 that you used ?
>>
>>    Thanks,
>>    Thejas
>>
>>
>>
>>    On 4/24/12 5:21 PM, Aniket Mokashi wrote:
>>
>>        Hi Thejas,
>>
>>        Can you point me to jira that fixes filter-union problem (in pig)?
>> I
>>        haven't tried hcat-0.4 yet, good to know about that issue. I
>>        will keep a
>>        watcher.
>>
>>        Thanks,
>>        Aniket
>>
>>        On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair
>>        <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>__>> wrote:
>>
>>            Hi Aniket,
>>            Are you using pig 0.9 or 0.9.1 ?
>>            If yes, can you try with pig 0.9.2 ?
>>            Wondering if you are also hitting the issue that Thomas
>>        mentioned .
>>
>>            Thanks,
>>            Thejas
>>
>>
>>
>>
>>            On 4/23/12 7:39 PM, Aniket Mokashi wrote:
>>
>>                Something similar I have noticed is -
>>
>>                A = load ...
>>                B1 = filter A by cond1;
>>                B2 = filter A by cond2;
>>                B3 = filter A by cond3;
>>
>>                B = union B1, B2, B3; does not push projection.
>>
>>                Is that expected?
>>
>>                Ideally, we should have "strict" mode under hcatalog,
>>        that when
>>                turned
>>                on will avoid executing pig queries on the full
>>        (partitioned) table.
>>
>>                Thanks,
>>                Aniket
>>
>>                On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
>>        <[email protected] <mailto:[email protected]>
>>        <mailto:rajesh.balamohan@__gmail.com
>>        <mailto:[email protected]>>
>>        <mailto:rajesh.balamohan@
>>        <mailto:rajesh.balamohan@>__gma__il.com <http://gmail.com>
>>
>>        <mailto:rajesh.balamohan@__gmail.com
>>        <mailto:[email protected]>>>> wrote:
>>
>>                    Hi Alan,
>>
>>                    Thanks for the quick response.
>>
>>                    I am using HCatalog 0.4.
>>
>>                    With simple PIG script it works great. HCatalog
>>        beautifully
>>                scans
>>                    only the relevant information. However, full scan
>>        happens
>>                only when
>>                    we have couple of additional joins and when we
>>        change the
>>                INNER JOIN
>>                    order (we also use "using skewed").
>>
>>                    Though we have looked into the debug logs, we saw the
>>                scanning of
>>                    number of records from the JobTracker's counters
>>        itself. Without
>>                    pruning, the m/r job was pretty much scanning the
>>        entire set
>>                of rows.
>>
>>                    I am not sure if there is a corner case, where in
>>        "skewed"
>>                join is
>>                    trying to override the filtering.
>>
>>                    ~Rajesh.B
>>
>>
>>
>>                    On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates
>>        <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>
>>        <mailto:[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>__>__>
>>
>>                wrote:
>>
>>                        What version of HCatalog are you using?  How do
>>        you know
>>                it is
>>                        scanning all the partitions, does it say so in
>>        the logs,
>>                or are
>>                        you getting all the records back?
>>
>>                        And yes, HCat is supposed to do partition
>>        pruning so that it
>>                        only scans the required partitions.
>>
>>                        Alan.
>>
>>                        On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan
>> wrote:
>>
>>         > Hi All,
>>         >
>>         > I have a hcatalog table "partitioned by (d string)".
>>         >
>>         > I have couple of days worth of data and when i run "show
>>                        partitions" it provides the correct daa.
>>         >
>>         > d=20111215
>>         > d=20111216
>>         > d=20111217
>>         > d=20111218
>>         > d=20111219
>>         > d=20111220
>>         > d=20111221
>>         > d=20111222
>>         > d=20111223
>>         > d=20111224
>>         > d=20111225
>>         > d=20120415
>>         >
>>         > However, when I run PIG with "filter a by d == '20120415'",
>>                        it ends up scanning all data.
>>         >
>>         > Is this a known bug/enhancement in HCatalog?. Ideally,
>>                        shouldn't it scan only the d=20120415 directory?
>>         >
>>         > Any pointers would be of great help.
>>         >
>>         >
>>         > --
>>         > ~Rajesh.B
>>
>>
>>
>>
>>                    --
>>                    ~Rajesh.B
>>
>>
>>
>>
>>                --
>>        "...:::Aniket:::... Quetzalco@tl"
>>
>>
>>
>>
>>
>>        --
>>        "...:::Aniket:::... Quetzalco@tl"
>>
>>
>>
>>
>>
>> --
>> "...:::Aniket:::... Quetzalco@tl"
>
>

Re: HCatalog scans all partition even after mentioning date filter

Reply via email to