That getAll() call destroyed our lazy deserialization optimizations, btw... it's unfortunate that even if my loader constructs optimized tuples, they immediately get turned into object-bloated regular tuples :(.
D On Sun, Jan 1, 2012 at 12:36 AM, Daniel Dai <da...@hortonworks.com> wrote: > Hi, Stan, > I miss one point in my previous mail. We do apply PushUpFilter rule first, > so filter will be pushed in front of the added ForEach in most cases. There > is also a bug before (See PIG-2339) but current code should be fixed. So > even you use as clause to change the name, partition filter should still > apply. > > Daniel > > On Sat, Dec 31, 2011 at 7:37 PM, Stan Rosenberg < > srosenb...@proclivitysystems.com> wrote: > > > Just to be clear, the concrete syntax had a typo; should have been: > > > > A = load 'daily_activity' USING HiveLoader WHERE date_partition >= > > 20110101 and date_partition <= 20110201; > > > > On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg > > <srosenb...@proclivitysystems.com> wrote: > > > > > > A = load 'daily_activity' from HiveLoader where date_partition >= > > > 20110101 and date_partition <= 20110201; > > > > > > stan > > > > > > On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <da...@hortonworks.com> > > wrote: > > >> Hi, Stan, > > >> Foreach is inserted only if you have "as" in "load" statement. This is > > to > > >> assure the data loaded conforms with "as" clause. At some point there > > is a > > >> bug in implementation, this should be fixed in PIG-2346 and will be > > >> included in all subsequent releases. > > >> > > >> Thanks, > > >> Daniel > > >> > > >> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < > > >> srosenb...@proclivitysystems.com> wrote: > > >> > > >>> Howdy All, > > >>> > > >>> I am resurrecting my previous message sent to the list on Dec. 7. > Let > > >>> me first summarize. In a nutshell, as far as I can tell, > > >>> partition-aware loading is broken > > >>> in pig, and the culprit is PIG-1188 wherein the final decision was to > > >>> introduce project & cast, i.e, foreach, after load. There are two > > >>> problems with that approach. > > >>> First, as indicated in my original message, 'getPartitionKeys' is > > >>> never invoked because instead of the expected instruction sequence > > >>> 'load; filter', PIG-1188 > > >>> changed it to 'load; foreach; filter'. Second, if a loader already > > >>> happens to project & cast in order to adhere the data to the schema, > > >>> then the foreach synthesized > > >>> by pig is a waste of time. > > >>> > > >>> Essentially, we had to undo the patch in 'PIG-1188' in order to get > > >>> partition filters to work; this enabled us to implement a HiveLoader > > >>> very much like > > >>> HCatLoader which incidentally is also broken for the very same > reason. > > >>> This is obviously a hack and a real solution is needed. > > >>> If the decision made in PIG-1188 cannot be re-considered, then I > > >>> suggest that we revisit the logic which is used to pass partition > > >>> filters to partition-aware loaders. > > >>> > > >>> Many thanks! > > >>> > > >>> stan > > >>> > > >>> > > >>> > > >>> ---------- Forwarded message ---------- > > >>> From: Stan Rosenberg <srosenb...@proclivitysystems.com> > > >>> Date: Wed, Dec 7, 2011 at 12:24 PM > > >>> Subject: Partition keys in LoadMetadata is broken in 0.10? > > >>> To: user@pig.apache.org > > >>> > > >>> > > >>> Hi, > > >>> > > >>> I am trying to implement a loader which is partition-aware. As > > >>> prescribed, my loader implements LoadMetadata, however, > > >>> getPartitionKeys is never invoked. > > >>> The script is of this form: > > >>> > > >>> X = LOAD 'input' USING MyLoader(); > > >>> X = FILTER X BY partition_col == 'some_string'; > > >>> > > >>> and the schema returned by MyLoader.getSchema includes the column > > >>> 'partition_col' which is of type 'chararray'. > > >>> > > >>> > > >>> After debugging pig, I have found what appears to be a bug in the new > > >>> code (version 0.10 snapshot and also in 0.9.1). The reason > > >>> MyLoader.getPartitionKeys is never invoked is due to the wrongfully > > >>> inserted > > >>> 'foreach' after the 'load' and before the 'filter'. The code in > > >>> TypeCastInserterTransformer.check used to return 'false' if the > > >>> schemas matched or all fields were of type 'bytearray'; cf. pig > > >>> version 0.8.1. > > >>> Effectively, the above script gets transformed into: > > >>> > > >>> X = LOAD 'input' USING MyLoader(); > > >>> X = FOREACH X GENERATE ...; > > >>> X = FILTER X BY partition_col == 'some_string'; > > >>> > > >>> Subsequently, PartitionFilterPushDownTransformer.check observes that > > >>> the immediate successor of 'load' is _not_ 'filter', whence > > >>> getPartitionKeys is never invoked. > > >>> > > >>> Any suggestions? > > >>> > > >>> Thanks, > > >>> > > >>> stan > > >>> > > >>> P.S. While in the above case the 'foreach' can be avoided, in general > > >>> typecasting may need to be performed if the user-provided schema does > > >>> not match the one returned by the loader. > > >>> I think the general case needs to be handled correctly, perhaps by > > >>> ignoring all synthetic operators after the 'load'. (This is just a > > >>> wild guess.) > > >>> > > >