Yeah I think we just need to get projection pushdown to work through Split operators.
D On Wed, Apr 25, 2012 at 12:52 PM, Thejas Nair <[email protected]> wrote: > cc'ing dev@pig as this is a pig issue. > > Aniket, What you saw is not related to PIG-2339 . > > In your example query, the logical plan will look like this - > > Load (A) > | > Split > | > --------------------------- > | | > Filter(B1) Filter(B2) ... > > Because of the split operator introduced between the filter conditions and > load, the filter does not get pushed into the load function. > > A simple way to fix this in pig would be to not share the load across the > filter operators. Another option is to push the condition (B1 or B2 or B3) > into Load operator and retain rest of the current plan (split and filters > following the split). > > You can ofcourse achieve the same effect by having a separate load > statememnt as input for each of the filters. > > I agree that we should make it possible to ask pig to throw a warning/error > if the query is going to result in a full table scan on a partitioned table. > > Thanks, > Thejas > > > > > On 4/24/12 7:56 PM, Aniket Mokashi wrote: >> >> Sorry Thejas, I didnt look into the jira properly earlier. >> EMR pig-0.9.1 already has that patch for PIG-2339 and hence I did not >> hit that issue earlier (and I patched datanucleus). filter-union was a >> workaround I was using to avoid some of the thrift timeout problems >> earlier. Thrift api's timeout on client side in 20sec by default (I >> found the config to change this later) and I hence used a = load >> 'table'; b1= filter by cond1; b2=filter by cond2;.. b= union b1, b2..; >> to expect to push these filters separately to the loader. But, that >> doesn't work in pig. (I can open a jira, but I havent done enough >> investigation at the code level). Thoughts? >> >> Thanks, >> Aniket >> >> On Tue, Apr 24, 2012 at 7:00 PM, Thejas Nair <[email protected] >> <mailto:[email protected]>> wrote: >> >> The issue was not specific to filter-union >> - https://issues.apache.org/__jira/browse/PIG-2339 >> <https://issues.apache.org/jira/browse/PIG-2339>. >> The fix was to do filter PushUpFilter before PartitionFilterOptimizer . >> >> As this is not a hcat issue, it should not matter if you have an >> older hcat version . fyi, this bug was not there in pig 0.8.x . >> Was it pig 0.9.0 or 0.9.1 that you used ? >> >> Thanks, >> Thejas >> >> >> >> On 4/24/12 5:21 PM, Aniket Mokashi wrote: >> >> Hi Thejas, >> >> Can you point me to jira that fixes filter-union problem (in pig)? >> I >> haven't tried hcat-0.4 yet, good to know about that issue. I >> will keep a >> watcher. >> >> Thanks, >> Aniket >> >> On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>__>> wrote: >> >> Hi Aniket, >> Are you using pig 0.9 or 0.9.1 ? >> If yes, can you try with pig 0.9.2 ? >> Wondering if you are also hitting the issue that Thomas >> mentioned . >> >> Thanks, >> Thejas >> >> >> >> >> On 4/23/12 7:39 PM, Aniket Mokashi wrote: >> >> Something similar I have noticed is - >> >> A = load ... >> B1 = filter A by cond1; >> B2 = filter A by cond2; >> B3 = filter A by cond3; >> >> B = union B1, B2, B3; does not push projection. >> >> Is that expected? >> >> Ideally, we should have "strict" mode under hcatalog, >> that when >> turned >> on will avoid executing pig queries on the full >> (partitioned) table. >> >> Thanks, >> Aniket >> >> On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan >> <[email protected] <mailto:[email protected]> >> <mailto:rajesh.balamohan@__gmail.com >> <mailto:[email protected]>> >> <mailto:rajesh.balamohan@ >> <mailto:rajesh.balamohan@>__gma__il.com <http://gmail.com> >> >> <mailto:rajesh.balamohan@__gmail.com >> <mailto:[email protected]>>>> wrote: >> >> Hi Alan, >> >> Thanks for the quick response. >> >> I am using HCatalog 0.4. >> >> With simple PIG script it works great. HCatalog >> beautifully >> scans >> only the relevant information. However, full scan >> happens >> only when >> we have couple of additional joins and when we >> change the >> INNER JOIN >> order (we also use "using skewed"). >> >> Though we have looked into the debug logs, we saw the >> scanning of >> number of records from the JobTracker's counters >> itself. Without >> pruning, the m/r job was pretty much scanning the >> entire set >> of rows. >> >> I am not sure if there is a corner case, where in >> "skewed" >> join is >> trying to override the filtering. >> >> ~Rajesh.B >> >> >> >> On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>__>__> >> >> wrote: >> >> What version of HCatalog are you using? How do >> you know >> it is >> scanning all the partitions, does it say so in >> the logs, >> or are >> you getting all the records back? >> >> And yes, HCat is supposed to do partition >> pruning so that it >> only scans the required partitions. >> >> Alan. >> >> On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan >> wrote: >> >> > Hi All, >> > >> > I have a hcatalog table "partitioned by (d string)". >> > >> > I have couple of days worth of data and when i run "show >> partitions" it provides the correct daa. >> > >> > d=20111215 >> > d=20111216 >> > d=20111217 >> > d=20111218 >> > d=20111219 >> > d=20111220 >> > d=20111221 >> > d=20111222 >> > d=20111223 >> > d=20111224 >> > d=20111225 >> > d=20120415 >> > >> > However, when I run PIG with "filter a by d == '20120415'", >> it ends up scanning all data. >> > >> > Is this a known bug/enhancement in HCatalog?. Ideally, >> shouldn't it scan only the d=20120415 directory? >> > >> > Any pointers would be of great help. >> > >> > >> > -- >> > ~Rajesh.B >> >> >> >> >> -- >> ~Rajesh.B >> >> >> >> >> -- >> "...:::Aniket:::... Quetzalco@tl" >> >> >> >> >> >> -- >> "...:::Aniket:::... Quetzalco@tl" >> >> >> >> >> >> -- >> "...:::Aniket:::... Quetzalco@tl" > >
