Re: Pig loader 0.6 to 0.7 migration guide

Alan Gates Fri, 18 Jun 2010 12:48:14 -0700

I've created https://issues.apache.org/jira/browse/PIG-1459 to capturethe need for a standard serialization method.

Regarding required field list, it is the last option. I believe thename was included since some loaders may think in terms of namesinstead of positions. I created https://issues.apache.org/jira/browse/PIG-1460to fix the documentation on this.


Alan.


On Jun 15, 2010, at 12:36 PM, Dmitriy Ryaboy wrote:

This is a good point and I don't want it to fall off the radar.
Hoping someone can answer the RequiredFieldList question.

-D
On Thu, Jun 10, 2010 at 2:56 PM, Scott Carey<sc...@richrelevance.com>wrote:
I wish there was better documentation on that too.
Looking at the PigStorage code, it serializes an array of Booleansvia
UDFContext to the backend.
It would be significantly better if Pig serialized the requestedfields forus, provided that pushProjection returned a code that indicatedthat the
projection would be supported.
Forcing users to do that serialization themselves is bug prone,especially
in the presence of nested schemas.

The documentation is also poor when it comes to describing what the
RequiredFieldList even is.
It has a name and an index field. The code itself seems to allowfor
either of these to be filled.  What do they mean?

Is it:
the schema returned by the loader is:
(id: int, name: chararray, department: chararray)

The RequiredFieldList is [ ("department", 1) , ("id", 0) ]

What does that mean?
* The name is the field name requested, and the index is thelocation it
should be in the result?  so return (id: int, department: chararray)?
* The index is the index in the source schema, and the name is for
renaming, so return (department: chararray, id: int) (where thedata in
department is actualy that from the original's name field)?
* The location in the RequiredFieldList array is the 'destination'
requested, the name is optional (if the schema had one) and theindex is thelocation in the original schema. so the above RequiredFieldList isactually
impossible, since "department" is always index 2.
I think it is the last one, but the first idea might be it too.Either waythe javadoc and other documentation does not describe what themeanings of
these values are nor what their possible ranges might be.

On Jun 5, 2010, at 6:34 PM, Andrew Rothstein wrote:
I'm trying to figure out how exactly to appropriately implement the
LoadPushDown interface in my LoadFunc implementation. I need to take
the list of column aliases and pass that from the
LoadPushDown.pushProjection(RequiredFieldList) function to make it
available in the getTuple function. I'm kind of new to this soforgiveme if this is obvious. From my readings of the mailing list itappears
that the pushProjection function is called in the front-end where as
the getTuple function is called in the back-end. How does a LoanFunc
pass information from the front to the back end instances?

regards, Andrew

On Thu, Jun 3, 2010 at 7:04 AM, Ankur C. Goel <gan...@yahoo-inc.com>
wrote:
A similar need is being expressed by zebra folks here -
https://issues.apache.org/jira/browse/PIG-1337.
You might want to comment/vote on it as it is scheduled for 0.8release.
Loading data in prepareToRead() is fine. For a workaround I thinkit
should be ok to read the data directly from HDFS in each of themappersprovided you aren't doing any costly namespace operations like'listStatus'that can stress the namesystem in the event of thousands of tasksexecuting
it concurrently.
Regards
-...@nkur

6/2/10 10:36 PM, "Scott Carey" <sc...@richrelevance.com> wrote:



On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:
Scott,
     You can set hadoop properties at the time of running your pig
script with -D option. So
pig -Dhadoop.property.name=something myscript essentially sets the
property in the job configuration.
So no programatic configuration of hadoop properties is allowed(where
its easier to control) but its allowable to set it at the scriptlevel? I
guess I can do that, but it complicates things.
Also this is a very poor way to do this. My script has 600 linesof Pig
and ~45 M/R jobs. Only three of the jobs need the distributedcache, not
all 45.
Speaking specifically of utilizing the distributed cachefeature, you
can just set the filename in LoadFunc constructor and then load thedata in
memory in getNext() method if not already loaded.
That is what the original idea was.
Here is the pig command to set up the distributed cache

pig
-Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name---> This name needs to be passed to UDF constructor so that itsavailable
in mapper/reducer's working dir on compute node.
     -Dmapred.create.symlink=yes
     script.pig
If that property is set, then constructor only needs file-name (the
symlink) right? Right now I'm trying to set those properties usingtheDistributedCache static interfaces which means I need to haveaccess to the
full path.
Implement something like a loadData() method that loads the dataonly
once and invoke it from getNext() method. The script will work evenin thelocal mode if the file distributed via distributed cache resides inthe CWD
from where script is invoked.
I'm loading the data in prepareToRead(), which seems mostappropriate.
Do you see any problem with that?
Hope that's helpful.
I think the command line property hack is insufficient. I amleft with
a choice of having a couple jobs read the file from HDFS directlyin theirmappers, or having all jobs unnecessarily set up distributedcache. Job
setup time is already 1/4 of my processing time.
Is there a feature request for Load/Store access to Hadoop job
configuration properties?
Ideally, this would be a method on LoadFunc that passes amodifiable
Configuration object in on the front-end, or a callback for a user to
optionally provide a Configuration object with the few propertiesyou wantto alter in it that Pig can apply to the real thing before itconfigures its
properties.
Thanks for the info Ankur,

-Scott
-...@nkur

On 6/2/10 2:53 PM, "Scott Carey" <sc...@richrelevance.com> wrote:

So, here are some things I'm struggling with now:
In a LoadFunc, If I want to load something intoDistributedCache. The
path is passed into the LoadFunc constructor as an argument.
Documentation on getSchema() and all other metadata methodsstate that
you can't modify the job or its configuration passed in. I'veverified that
changes to the Configuration are ignored if set here.
It appears that I could set these properties in setLocation()but that
is called a lot on the back-end too, and the documentation does notstate ifsetLocation() is called at all on the front-end. Based on myexperimental
results, it doesn't seem to.
Is there no way to modify Hadoop properties on the front-end toutilize
hadoop features? UDFContext seems completely useless for settinghadoopproperties for things other than the UDF itself -- like distributedcachesettings. A stand-alone front-end hook for this would be great.Otherwise,
any hack that works would be acceptable for now.
* The documentation for LoadMetadata can use some informationabout
when each method gets called -- front end only? Between what othercalls?
* UDFContext's documentation needs help too --
** addJobConf() is public, but not expected to be used by end-users,
right?  Several public methods here look like they need better
documentation, and the class itself could use a javadoc entry withsome
example uses.
On May 24, 2010, at 11:06 AM, Alan Gates wrote:
Scott,

I made an effort to address the documentation in
https://issues.apache.org/jira/browse/PIG-1370
If you have a chance take a look and let me know if it deals with
the issues you have or if more work is needed.

Alan.

On May 24, 2010, at 11:00 AM, Scott Carey wrote:
I have been using these documents for a couple weeks,implementing
various store and load functionality, and they have been very
helpful.
However, there is room for improvement.  What is most unclear is
when the API methods get called. Each method should clearlystatein these documents (and the javadoc) when it is called --front-endonly? back-end only? both? Sometimes this is obvious, othertimes
it is not.
For example, without looking at the source code its notpossible totell or infer if pushProjection() is called on the front-endor back-end, or both. It could be implemented by being called on thefront-end, expecting the loader implementation to persist necessarystateto UDFContext for the back-end, or be called only on the back-end,
or both.  One has to look at PigStorage source to see that it
persists the pushProjection information into UDFContext, so its
_probably_ only called on the front-end.

There are also a few types that these interfaces return or are
provided that are completely undocumented.  I had to look at the
source code to figure out what ResourceStatistics does, and how
ResourceSchema should be used. RequiredField,RequiredFieldList,
and RequiredFieldResponse are all poorly documented aspects of a
public interface.


On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
To add to this, there is also a how-to document on how to goabout
writing load/store functions from scratch in Pig 0.7 at
http://wiki.apache.org/pig/Pig070LoadStoreHowTo.

Pradeep

-----Original Message-----
From: Alan Gates [mailto:ga...@yahoo-inc.com]
Sent: Friday, May 21, 2010 11:33 AM
To: pig-user@hadoop.apache.org
Cc: Eli Collins
Subject: Pig loader 0.6 to 0.7 migration guide

At the Bay Area HUG on Wednesday someone (Eli I think, though I
might
be remembering incorrectly) asked if there was a migrationguide formoving Pig load and store functions from 0.6 to 0.7. I saidthere
was
but I couldn't remember if it had been posted yet or not. Infact
it
had already been posted to
http://wiki.apache.org/pig/LoadStoreMigrationGuide
. Also, you can find the list of all incompatible changesfor 0.7
at
http://wiki.apache.org/pig/Pig070IncompatibleChanges
. Sorry, I should have included those links in my originalslides.
Alan.

Re: Pig loader 0.6 to 0.7 migration guide

Reply via email to