Lol, the one we weren't talking about :). Sorry, thought it was related.
This, in PigGenericMapBase:


       for (PhysicalOperator root : roots) {

            if (inIllustrator) {

                if (root != null) {

                    root.attachInput(inpTuple);

                }

            } else {

*                root.attachInput(tf.newTupleNoCopy(inpTuple.getAll()));*

            }

        }

On Sun, Jan 1, 2012 at 6:09 PM, Daniel Dai <da...@hortonworks.com> wrote:

> Which getAll() call do you mean?
>
> On Sun, Jan 1, 2012 at 5:34 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:
>
> > That getAll() call destroyed our lazy deserialization optimizations,
> btw...
> > it's unfortunate that even if my loader constructs optimized tuples, they
> > immediately get turned into object-bloated regular tuples :(.
> >
> > D
> >
> > On Sun, Jan 1, 2012 at 12:36 AM, Daniel Dai <da...@hortonworks.com>
> wrote:
> >
> > > Hi, Stan,
> > > I miss one point in my previous mail. We do apply PushUpFilter rule
> > first,
> > > so filter will be pushed in front of the added ForEach in most cases.
> > There
> > > is also a bug before (See PIG-2339) but current code should be fixed.
> So
> > > even you use as clause to change the name, partition filter should
> still
> > > apply.
> > >
> > > Daniel
> > >
> > > On Sat, Dec 31, 2011 at 7:37 PM, Stan Rosenberg <
> > > srosenb...@proclivitysystems.com> wrote:
> > >
> > > > Just to be clear, the concrete syntax had a typo; should have been:
> > > >
> > > > A = load 'daily_activity' USING HiveLoader WHERE date_partition >=
> > > > 20110101 and date_partition <= 20110201;
> > > >
> > > > On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg
> > > > <srosenb...@proclivitysystems.com> wrote:
> > > > >
> > > > > A = load 'daily_activity' from HiveLoader where date_partition >=
> > > > > 20110101 and date_partition <= 20110201;
> > > > >
> > > > > stan
> > > > >
> > > > > On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <da...@hortonworks.com
> >
> > > > wrote:
> > > > >> Hi, Stan,
> > > > >> Foreach is inserted only if you have "as" in "load" statement.
> This
> > is
> > > > to
> > > > >> assure the data loaded conforms with "as" clause. At some point
> > there
> > > > is a
> > > > >> bug in implementation, this should be fixed in PIG-2346 and will
> be
> > > > >> included in all subsequent releases.
> > > > >>
> > > > >> Thanks,
> > > > >> Daniel
> > > > >>
> > > > >> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg <
> > > > >> srosenb...@proclivitysystems.com> wrote:
> > > > >>
> > > > >>> Howdy All,
> > > > >>>
> > > > >>> I am resurrecting my previous message sent to the list on Dec. 7.
> > >  Let
> > > > >>> me first summarize.  In a nutshell, as far as I can tell,
> > > > >>> partition-aware loading is broken
> > > > >>> in pig, and the culprit is PIG-1188 wherein the final decision
> was
> > to
> > > > >>> introduce project & cast, i.e, foreach, after load.  There are
> two
> > > > >>> problems with that approach.
> > > > >>> First, as indicated in my original message, 'getPartitionKeys' is
> > > > >>> never invoked because instead of the expected instruction
> sequence
> > > > >>> 'load; filter', PIG-1188
> > > > >>> changed it to 'load; foreach; filter'.  Second, if a loader
> already
> > > > >>> happens to project & cast in order to adhere the data to the
> > schema,
> > > > >>> then the foreach synthesized
> > > > >>> by pig is a waste of time.
> > > > >>>
> > > > >>> Essentially, we had to undo the patch in 'PIG-1188' in order to
> get
> > > > >>> partition filters to work; this enabled us to implement a
> > HiveLoader
> > > > >>> very much like
> > > > >>> HCatLoader which incidentally is also broken for the very same
> > > reason.
> > > > >>>  This is obviously a hack and a real solution is needed.
> > > > >>> If the decision made in PIG-1188 cannot be re-considered, then I
> > > > >>> suggest that we revisit the logic which is used to pass partition
> > > > >>> filters to partition-aware loaders.
> > > > >>>
> > > > >>> Many thanks!
> > > > >>>
> > > > >>> stan
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> ---------- Forwarded message ----------
> > > > >>> From: Stan Rosenberg <srosenb...@proclivitysystems.com>
> > > > >>> Date: Wed, Dec 7, 2011 at 12:24 PM
> > > > >>> Subject: Partition keys in LoadMetadata is broken in 0.10?
> > > > >>> To: user@pig.apache.org
> > > > >>>
> > > > >>>
> > > > >>> Hi,
> > > > >>>
> > > > >>> I am trying to implement a loader which is partition-aware.  As
> > > > >>> prescribed, my loader implements LoadMetadata, however,
> > > > >>> getPartitionKeys is never invoked.
> > > > >>> The script is of this form:
> > > > >>>
> > > > >>> X = LOAD 'input' USING MyLoader();
> > > > >>> X = FILTER X BY partition_col == 'some_string';
> > > > >>>
> > > > >>> and the schema returned by MyLoader.getSchema includes the column
> > > > >>> 'partition_col' which is of type 'chararray'.
> > > > >>>
> > > > >>>
> > > > >>> After debugging pig, I have found what appears to be a bug in the
> > new
> > > > >>> code (version 0.10 snapshot and also in 0.9.1).  The reason
> > > > >>> MyLoader.getPartitionKeys is never invoked is due to the
> wrongfully
> > > > >>> inserted
> > > > >>> 'foreach' after the 'load' and before the 'filter'.  The code in
> > > > >>> TypeCastInserterTransformer.check used to return 'false' if the
> > > > >>> schemas matched or all fields were of type 'bytearray'; cf. pig
> > > > >>> version 0.8.1.
> > > > >>> Effectively, the above script gets transformed into:
> > > > >>>
> > > > >>> X = LOAD 'input' USING MyLoader();
> > > > >>> X = FOREACH X GENERATE ...;
> > > > >>> X = FILTER X BY partition_col == 'some_string';
> > > > >>>
> > > > >>> Subsequently, PartitionFilterPushDownTransformer.check observes
> > that
> > > > >>> the immediate successor of 'load' is _not_ 'filter', whence
> > > > >>> getPartitionKeys is never invoked.
> > > > >>>
> > > > >>> Any suggestions?
> > > > >>>
> > > > >>> Thanks,
> > > > >>>
> > > > >>> stan
> > > > >>>
> > > > >>> P.S. While in the above case the 'foreach' can be avoided, in
> > general
> > > > >>> typecasting may need to be performed if the user-provided schema
> > does
> > > > >>> not match the one returned by the loader.
> > > > >>> I think the general case needs to be handled correctly, perhaps
> by
> > > > >>> ignoring all synthetic operators after the 'load'.  (This is
> just a
> > > > >>> wild guess.)
> > > > >>>
> > > >
> > >
> >
>

Reply via email to