yes, please create one.
Thanks,
Thejas
On 4/25/12 1:47 PM, Aniket Mokashi wrote:
Hi Dmitriy and Thejas,
Should I open a jira for the same?
Thanks,
Aniket
On Wed, Apr 25, 2012 at 1:45 PM, Dmitriy Ryaboy <[email protected]
<mailto:[email protected]>> wrote:
Yeah I think we just need to get projection pushdown to work through
Split operators.
D
On Wed, Apr 25, 2012 at 12:52 PM, Thejas Nair
<[email protected] <mailto:[email protected]>> wrote:
> cc'ing dev@pig as this is a pig issue.
>
> Aniket, What you saw is not related to PIG-2339 .
>
> In your example query, the logical plan will look like this -
>
> Load (A)
> |
> Split
> |
> ---------------------------
> | |
> Filter(B1) Filter(B2) ...
>
> Because of the split operator introduced between the filter
conditions and
> load, the filter does not get pushed into the load function.
>
> A simple way to fix this in pig would be to not share the load
across the
> filter operators. Another option is to push the condition (B1 or
B2 or B3)
> into Load operator and retain rest of the current plan (split and
filters
> following the split).
>
> You can ofcourse achieve the same effect by having a separate load
> statememnt as input for each of the filters.
>
> I agree that we should make it possible to ask pig to throw a
warning/error
> if the query is going to result in a full table scan on a
partitioned table.
>
> Thanks,
> Thejas
>
>
>
>
> On 4/24/12 7:56 PM, Aniket Mokashi wrote:
>>
>> Sorry Thejas, I didnt look into the jira properly earlier.
>> EMR pig-0.9.1 already has that patch for PIG-2339 and hence I
did not
>> hit that issue earlier (and I patched datanucleus). filter-union
was a
>> workaround I was using to avoid some of the thrift timeout problems
>> earlier. Thrift api's timeout on client side in 20sec by default (I
>> found the config to change this later) and I hence used a = load
>> 'table'; b1= filter by cond1; b2=filter by cond2;.. b= union b1,
b2..;
>> to expect to push these filters separately to the loader. But, that
>> doesn't work in pig. (I can open a jira, but I havent done enough
>> investigation at the code level). Thoughts?
>>
>> Thanks,
>> Aniket
>>
>> On Tue, Apr 24, 2012 at 7:00 PM, Thejas Nair
<[email protected] <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>>
wrote:
>>
>> The issue was not specific to filter-union
>> - https://issues.apache.org/__jira/browse/PIG-2339
>> <https://issues.apache.org/jira/browse/PIG-2339>.
>> The fix was to do filter PushUpFilter before
PartitionFilterOptimizer .
>>
>> As this is not a hcat issue, it should not matter if you have an
>> older hcat version . fyi, this bug was not there in pig 0.8.x .
>> Was it pig 0.9.0 or 0.9.1 that you used ?
>>
>> Thanks,
>> Thejas
>>
>>
>>
>> On 4/24/12 5:21 PM, Aniket Mokashi wrote:
>>
>> Hi Thejas,
>>
>> Can you point me to jira that fixes filter-union problem
(in pig)?
>> I
>> haven't tried hcat-0.4 yet, good to know about that issue. I
>> will keep a
>> watcher.
>>
>> Thanks,
>> Aniket
>>
>> On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair
>> <[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
>> <mailto:[email protected] <mailto:[email protected]>
>> <mailto:[email protected]
<mailto:[email protected]>>__>> wrote:
>>
>> Hi Aniket,
>> Are you using pig 0.9 or 0.9.1 ?
>> If yes, can you try with pig 0.9.2 ?
>> Wondering if you are also hitting the issue that Thomas
>> mentioned .
>>
>> Thanks,
>> Thejas
>>
>>
>>
>>
>> On 4/23/12 7:39 PM, Aniket Mokashi wrote:
>>
>> Something similar I have noticed is -
>>
>> A = load ...
>> B1 = filter A by cond1;
>> B2 = filter A by cond2;
>> B3 = filter A by cond3;
>>
>> B = union B1, B2, B3; does not push projection.
>>
>> Is that expected?
>>
>> Ideally, we should have "strict" mode under hcatalog,
>> that when
>> turned
>> on will avoid executing pig queries on the full
>> (partitioned) table.
>>
>> Thanks,
>> Aniket
>>
>> On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
>> <[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
>> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>__gmail.com
<http://gmail.com>
>> <mailto:[email protected]
<mailto:[email protected]>>>
>> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>
>> <mailto:rajesh.balamohan@
<mailto:rajesh.balamohan@>>__gma__il.com <http://gma__il.com>
<http://gmail.com>
>>
>> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>__gmail.com
<http://gmail.com>
>> <mailto:[email protected]
<mailto:[email protected]>>>>> wrote:
>>
>> Hi Alan,
>>
>> Thanks for the quick response.
>>
>> I am using HCatalog 0.4.
>>
>> With simple PIG script it works great. HCatalog
>> beautifully
>> scans
>> only the relevant information. However, full scan
>> happens
>> only when
>> we have couple of additional joins and when we
>> change the
>> INNER JOIN
>> order (we also use "using skewed").
>>
>> Though we have looked into the debug logs, we
saw the
>> scanning of
>> number of records from the JobTracker's counters
>> itself. Without
>> pruning, the m/r job was pretty much scanning the
>> entire set
>> of rows.
>>
>> I am not sure if there is a corner case, where in
>> "skewed"
>> join is
>> trying to override the filtering.
>>
>> ~Rajesh.B
>>
>>
>>
>> On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates
>> <[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
>> <mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
>> <mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
>> <mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>__>__>
>>
>> wrote:
>>
>> What version of HCatalog are you using?
How do
>> you know
>> it is
>> scanning all the partitions, does it say
so in
>> the logs,
>> or are
>> you getting all the records back?
>>
>> And yes, HCat is supposed to do partition
>> pruning so that it
>> only scans the required partitions.
>>
>> Alan.
>>
>> On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan
>> wrote:
>>
>> > Hi All,
>> >
>> > I have a hcatalog table "partitioned by (d string)".
>> >
>> > I have couple of days worth of data and when i run "show
>> partitions" it provides the correct daa.
>> >
>> > d=20111215
>> > d=20111216
>> > d=20111217
>> > d=20111218
>> > d=20111219
>> > d=20111220
>> > d=20111221
>> > d=20111222
>> > d=20111223
>> > d=20111224
>> > d=20111225
>> > d=20120415
>> >
>> > However, when I run PIG with "filter a by d == '20120415'",
>> it ends up scanning all data.
>> >
>> > Is this a known bug/enhancement in HCatalog?. Ideally,
>> shouldn't it scan only the d=20120415
directory?
>> >
>> > Any pointers would be of great help.
>> >
>> >
>> > --
>> > ~Rajesh.B
>>
>>
>>
>>
>> --
>> ~Rajesh.B
>>
>>
>>
>>
>> --
>> "...:::Aniket:::... Quetzalco@tl"
>>
>>
>>
>>
>>
>> --
>> "...:::Aniket:::... Quetzalco@tl"
>>
>>
>>
>>
>>
>> --
>> "...:::Aniket:::... Quetzalco@tl"
>
>
--
"...:::Aniket:::... Quetzalco@tl"