Just to be clear, the concrete syntax had a typo; should have been: A = load 'daily_activity' USING HiveLoader WHERE date_partition >= 20110101 and date_partition <= 20110201;
On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg <srosenb...@proclivitysystems.com> wrote: > > A = load 'daily_activity' from HiveLoader where date_partition >= > 20110101 and date_partition <= 20110201; > > stan > > On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <da...@hortonworks.com> wrote: >> Hi, Stan, >> Foreach is inserted only if you have "as" in "load" statement. This is to >> assure the data loaded conforms with "as" clause. At some point there is a >> bug in implementation, this should be fixed in PIG-2346 and will be >> included in all subsequent releases. >> >> Thanks, >> Daniel >> >> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < >> srosenb...@proclivitysystems.com> wrote: >> >>> Howdy All, >>> >>> I am resurrecting my previous message sent to the list on Dec. 7. Let >>> me first summarize. In a nutshell, as far as I can tell, >>> partition-aware loading is broken >>> in pig, and the culprit is PIG-1188 wherein the final decision was to >>> introduce project & cast, i.e, foreach, after load. There are two >>> problems with that approach. >>> First, as indicated in my original message, 'getPartitionKeys' is >>> never invoked because instead of the expected instruction sequence >>> 'load; filter', PIG-1188 >>> changed it to 'load; foreach; filter'. Second, if a loader already >>> happens to project & cast in order to adhere the data to the schema, >>> then the foreach synthesized >>> by pig is a waste of time. >>> >>> Essentially, we had to undo the patch in 'PIG-1188' in order to get >>> partition filters to work; this enabled us to implement a HiveLoader >>> very much like >>> HCatLoader which incidentally is also broken for the very same reason. >>> This is obviously a hack and a real solution is needed. >>> If the decision made in PIG-1188 cannot be re-considered, then I >>> suggest that we revisit the logic which is used to pass partition >>> filters to partition-aware loaders. >>> >>> Many thanks! >>> >>> stan >>> >>> >>> >>> ---------- Forwarded message ---------- >>> From: Stan Rosenberg <srosenb...@proclivitysystems.com> >>> Date: Wed, Dec 7, 2011 at 12:24 PM >>> Subject: Partition keys in LoadMetadata is broken in 0.10? >>> To: user@pig.apache.org >>> >>> >>> Hi, >>> >>> I am trying to implement a loader which is partition-aware. As >>> prescribed, my loader implements LoadMetadata, however, >>> getPartitionKeys is never invoked. >>> The script is of this form: >>> >>> X = LOAD 'input' USING MyLoader(); >>> X = FILTER X BY partition_col == 'some_string'; >>> >>> and the schema returned by MyLoader.getSchema includes the column >>> 'partition_col' which is of type 'chararray'. >>> >>> >>> After debugging pig, I have found what appears to be a bug in the new >>> code (version 0.10 snapshot and also in 0.9.1). The reason >>> MyLoader.getPartitionKeys is never invoked is due to the wrongfully >>> inserted >>> 'foreach' after the 'load' and before the 'filter'. The code in >>> TypeCastInserterTransformer.check used to return 'false' if the >>> schemas matched or all fields were of type 'bytearray'; cf. pig >>> version 0.8.1. >>> Effectively, the above script gets transformed into: >>> >>> X = LOAD 'input' USING MyLoader(); >>> X = FOREACH X GENERATE ...; >>> X = FILTER X BY partition_col == 'some_string'; >>> >>> Subsequently, PartitionFilterPushDownTransformer.check observes that >>> the immediate successor of 'load' is _not_ 'filter', whence >>> getPartitionKeys is never invoked. >>> >>> Any suggestions? >>> >>> Thanks, >>> >>> stan >>> >>> P.S. While in the above case the 'foreach' can be avoided, in general >>> typecasting may need to be performed if the user-provided schema does >>> not match the one returned by the loader. >>> I think the general case needs to be handled correctly, perhaps by >>> ignoring all synthetic operators after the 'load'. (This is just a >>> wild guess.) >>>