Hi, Stan,
I miss one point in my previous mail. We do apply PushUpFilter rule first,
so filter will be pushed in front of the added ForEach in most cases. There
is also a bug before (See PIG-2339) but current code should be fixed. So
even you use as clause to change the name, partition filter should still
apply.

Daniel

On Sat, Dec 31, 2011 at 7:37 PM, Stan Rosenberg <
srosenb...@proclivitysystems.com> wrote:

> Just to be clear, the concrete syntax had a typo; should have been:
>
> A = load 'daily_activity' USING HiveLoader WHERE date_partition >=
> 20110101 and date_partition <= 20110201;
>
> On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg
> <srosenb...@proclivitysystems.com> wrote:
> >
> > A = load 'daily_activity' from HiveLoader where date_partition >=
> > 20110101 and date_partition <= 20110201;
> >
> > stan
> >
> > On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <da...@hortonworks.com>
> wrote:
> >> Hi, Stan,
> >> Foreach is inserted only if you have "as" in "load" statement. This is
> to
> >> assure the data loaded conforms with "as" clause. At some point there
> is a
> >> bug in implementation, this should be fixed in PIG-2346 and will be
> >> included in all subsequent releases.
> >>
> >> Thanks,
> >> Daniel
> >>
> >> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg <
> >> srosenb...@proclivitysystems.com> wrote:
> >>
> >>> Howdy All,
> >>>
> >>> I am resurrecting my previous message sent to the list on Dec. 7.  Let
> >>> me first summarize.  In a nutshell, as far as I can tell,
> >>> partition-aware loading is broken
> >>> in pig, and the culprit is PIG-1188 wherein the final decision was to
> >>> introduce project & cast, i.e, foreach, after load.  There are two
> >>> problems with that approach.
> >>> First, as indicated in my original message, 'getPartitionKeys' is
> >>> never invoked because instead of the expected instruction sequence
> >>> 'load; filter', PIG-1188
> >>> changed it to 'load; foreach; filter'.  Second, if a loader already
> >>> happens to project & cast in order to adhere the data to the schema,
> >>> then the foreach synthesized
> >>> by pig is a waste of time.
> >>>
> >>> Essentially, we had to undo the patch in 'PIG-1188' in order to get
> >>> partition filters to work; this enabled us to implement a HiveLoader
> >>> very much like
> >>> HCatLoader which incidentally is also broken for the very same reason.
> >>>  This is obviously a hack and a real solution is needed.
> >>> If the decision made in PIG-1188 cannot be re-considered, then I
> >>> suggest that we revisit the logic which is used to pass partition
> >>> filters to partition-aware loaders.
> >>>
> >>> Many thanks!
> >>>
> >>> stan
> >>>
> >>>
> >>>
> >>> ---------- Forwarded message ----------
> >>> From: Stan Rosenberg <srosenb...@proclivitysystems.com>
> >>> Date: Wed, Dec 7, 2011 at 12:24 PM
> >>> Subject: Partition keys in LoadMetadata is broken in 0.10?
> >>> To: user@pig.apache.org
> >>>
> >>>
> >>> Hi,
> >>>
> >>> I am trying to implement a loader which is partition-aware.  As
> >>> prescribed, my loader implements LoadMetadata, however,
> >>> getPartitionKeys is never invoked.
> >>> The script is of this form:
> >>>
> >>> X = LOAD 'input' USING MyLoader();
> >>> X = FILTER X BY partition_col == 'some_string';
> >>>
> >>> and the schema returned by MyLoader.getSchema includes the column
> >>> 'partition_col' which is of type 'chararray'.
> >>>
> >>>
> >>> After debugging pig, I have found what appears to be a bug in the new
> >>> code (version 0.10 snapshot and also in 0.9.1).  The reason
> >>> MyLoader.getPartitionKeys is never invoked is due to the wrongfully
> >>> inserted
> >>> 'foreach' after the 'load' and before the 'filter'.  The code in
> >>> TypeCastInserterTransformer.check used to return 'false' if the
> >>> schemas matched or all fields were of type 'bytearray'; cf. pig
> >>> version 0.8.1.
> >>> Effectively, the above script gets transformed into:
> >>>
> >>> X = LOAD 'input' USING MyLoader();
> >>> X = FOREACH X GENERATE ...;
> >>> X = FILTER X BY partition_col == 'some_string';
> >>>
> >>> Subsequently, PartitionFilterPushDownTransformer.check observes that
> >>> the immediate successor of 'load' is _not_ 'filter', whence
> >>> getPartitionKeys is never invoked.
> >>>
> >>> Any suggestions?
> >>>
> >>> Thanks,
> >>>
> >>> stan
> >>>
> >>> P.S. While in the above case the 'foreach' can be avoided, in general
> >>> typecasting may need to be performed if the user-provided schema does
> >>> not match the one returned by the loader.
> >>> I think the general case needs to be handled correctly, perhaps by
> >>> ignoring all synthetic operators after the 'load'.  (This is just a
> >>> wild guess.)
> >>>
>

Reply via email to