I've created https://issues.apache.org/jira/browse/PIG-1459 to capture the need for a standard serialization method.

Regarding required field list, it is the last option. I believe the name was included since some loaders may think in terms of names instead of positions. I created https://issues.apache.org/jira/browse/PIG-1460 to fix the documentation on this.

Alan.


On Jun 15, 2010, at 12:36 PM, Dmitriy Ryaboy wrote:

This is a good point and I don't want it to fall off the radar.
Hoping someone can answer the RequiredFieldList question.

-D

On Thu, Jun 10, 2010 at 2:56 PM, Scott Carey <sc...@richrelevance.com>wrote:

I wish there was better documentation on that too.

Looking at the PigStorage code, it serializes an array of Booleans via
UDFContext to the backend.

It would be significantly better if Pig serialized the requested fields for us, provided that pushProjection returned a code that indicated that the
projection would be supported.

Forcing users to do that serialization themselves is bug prone, especially
in the presence of nested schemas.

The documentation is also poor when it comes to describing what the
RequiredFieldList even is.

It has a name and an index field. The code itself seems to allow for
either of these to be filled.  What do they mean?

Is it:
the schema returned by the loader is:
(id: int, name: chararray, department: chararray)

The RequiredFieldList is [ ("department", 1) , ("id", 0) ]

What does that mean?
* The name is the field name requested, and the index is the location it
should be in the result?  so return (id: int, department: chararray)?
* The index is the index in the source schema, and the name is for
renaming, so return (department: chararray, id: int) (where the data in
department is actualy that from the original's name field)?
* The location in the RequiredFieldList array is the 'destination'
requested, the name is optional (if the schema had one) and the index is the location in the original schema. so the above RequiredFieldList is actually
impossible, since "department" is always index 2.

I think it is the last one, but the first idea might be it too. Either way the javadoc and other documentation does not describe what the meanings of
these values are nor what their possible ranges might be.

On Jun 5, 2010, at 6:34 PM, Andrew Rothstein wrote:

I'm trying to figure out how exactly to appropriately implement the
LoadPushDown interface in my LoadFunc implementation. I need to take
the list of column aliases and pass that from the
LoadPushDown.pushProjection(RequiredFieldList) function to make it
available in the getTuple function. I'm kind of new to this so forgive me if this is obvious. From my readings of the mailing list it appears
that the pushProjection function is called in the front-end where as
the getTuple function is called in the back-end. How does a LoanFunc
pass information from the front to the back end instances?

regards, Andrew

On Thu, Jun 3, 2010 at 7:04 AM, Ankur C. Goel <gan...@yahoo-inc.com>
wrote:
A similar need is being expressed by zebra folks here -
https://issues.apache.org/jira/browse/PIG-1337.
You might want to comment/vote on it as it is scheduled for 0.8 release.

Loading data in prepareToRead() is fine. For a workaround I think it
should be ok to read the data directly from HDFS in each of the mappers provided you aren't doing any costly namespace operations like 'listStatus' that can stress the namesystem in the event of thousands of tasks executing
it concurrently.

Regards
-...@nkur

6/2/10 10:36 PM, "Scott Carey" <sc...@richrelevance.com> wrote:



On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:

Scott,
     You can set hadoop properties at the time of running your pig
script with -D option. So
pig -Dhadoop.property.name=something myscript essentially sets the
property in the job configuration.


So no programatic configuration of hadoop properties is allowed (where
its easier to control) but its allowable to set it at the script level? I
guess I can do that, but it complicates things.
Also this is a very poor way to do this. My script has 600 lines of Pig
and ~45 M/R jobs. Only three of the jobs need the distributed cache, not
all 45.

Speaking specifically of utilizing the distributed cache feature, you
can just set the filename in LoadFunc constructor and then load the data in
memory in getNext() method if not already loaded.


That is what the original idea was.

Here is the pig command to set up the distributed cache

pig
-Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/ distributed-cache#file-name ---> This name needs to be passed to UDF constructor so that its available
in mapper/reducer's working dir on compute node.
     -Dmapred.create.symlink=yes
     script.pig

If that property is set, then constructor only needs file-name (the
symlink) right? Right now I'm trying to set those properties using the DistributedCache static interfaces which means I need to have access to the
full path.


Implement something like a loadData() method that loads the data only
once and invoke it from getNext() method. The script will work even in the local mode if the file distributed via distributed cache resides in the CWD
from where script is invoked.


I'm loading the data in prepareToRead(), which seems most appropriate.
Do you see any problem with that?

Hope that's helpful.

I think the command line property hack is insufficient. I am left with
a choice of having a couple jobs read the file from HDFS directly in their mappers, or having all jobs unnecessarily set up distributed cache. Job
setup time is already 1/4 of my processing time.
Is there a feature request for Load/Store access to Hadoop job
configuration properties?

Ideally, this would be a method on LoadFunc that passes a modifiable
Configuration object in on the front-end, or a callback for a user to
optionally provide a Configuration object with the few properties you want to alter in it that Pig can apply to the real thing before it configures its
properties.

Thanks for the info Ankur,

-Scott


-...@nkur

On 6/2/10 2:53 PM, "Scott Carey" <sc...@richrelevance.com> wrote:

So, here are some things I'm struggling with now:

In a LoadFunc, If I want to load something into DistributedCache. The
path is passed into the LoadFunc constructor as an argument.
Documentation on getSchema() and all other metadata methods state that
you can't modify the job or its configuration passed in. I've verified that
changes to the Configuration are ignored if set here.

It appears that I could set these properties in setLocation() but that
is called a lot on the back-end too, and the documentation does not state if setLocation() is called at all on the front-end. Based on my experimental
results, it doesn't seem to.
Is there no way to modify Hadoop properties on the front-end to utilize
hadoop features? UDFContext seems completely useless for setting hadoop properties for things other than the UDF itself -- like distributed cache settings. A stand-alone front-end hook for this would be great. Otherwise,
any hack that works would be acceptable for now.


* The documentation for LoadMetadata can use some information about
when each method gets called -- front end only? Between what other calls?
* UDFContext's documentation needs help too --
** addJobConf() is public, but not expected to be used by end- users,
right?  Several public methods here look like they need better
documentation, and the class itself could use a javadoc entry with some
example uses.


On May 24, 2010, at 11:06 AM, Alan Gates wrote:

Scott,

I made an effort to address the documentation in
https://issues.apache.org/jira/browse/PIG-1370
If you have a chance take a look and let me know if it deals with
the issues you have or if more work is needed.

Alan.

On May 24, 2010, at 11:00 AM, Scott Carey wrote:

I have been using these documents for a couple weeks, implementing
various store and load functionality, and they have been very
helpful.

However, there is room for improvement.  What is most unclear is
when the API methods get called. Each method should clearly state in these documents (and the javadoc) when it is called -- front-end only? back-end only? both? Sometimes this is obvious, other times
it is not.
For example, without looking at the source code its not possible to tell or infer if pushProjection() is called on the front-end or back- end, or both. It could be implemented by being called on the front- end, expecting the loader implementation to persist necessary state to UDFContext for the back-end, or be called only on the back- end,
or both.  One has to look at PigStorage source to see that it
persists the pushProjection information into UDFContext, so its
_probably_ only called on the front-end.

There are also a few types that these interfaces return or are
provided that are completely undocumented.  I had to look at the
source code to figure out what ResourceStatistics does, and how
ResourceSchema should be used. RequiredField, RequiredFieldList,
and RequiredFieldResponse are all poorly documented aspects of a
public interface.


On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:

To add to this, there is also a how-to document on how to go about
writing load/store functions from scratch in Pig 0.7 at
http://wiki.apache.org/pig/Pig070LoadStoreHowTo.

Pradeep

-----Original Message-----
From: Alan Gates [mailto:ga...@yahoo-inc.com]
Sent: Friday, May 21, 2010 11:33 AM
To: pig-user@hadoop.apache.org
Cc: Eli Collins
Subject: Pig loader 0.6 to 0.7 migration guide

At the Bay Area HUG on Wednesday someone (Eli I think, though I
might
be remembering incorrectly) asked if there was a migration guide for moving Pig load and store functions from 0.6 to 0.7. I said there
was
but I couldn't remember if it had been posted yet or not. In fact
it
had already been posted to
http://wiki.apache.org/pig/LoadStoreMigrationGuide
. Also, you can find the list of all incompatible changes for 0.7
at
http://wiki.apache.org/pig/Pig070IncompatibleChanges
. Sorry, I should have included those links in my original slides.

Alan.










Reply via email to