Charlie,
I apologize, I got busy and let this thread drop. Comments inlined below.
Charlie Groves wrote:
On Jan 18, 2008, at 11:45 AM, Alan Gates wrote:
We're definitely interested.
Excellent!
Our thinking of how to provide field metadata (name and eventually
types) for pig queries was to allow several options:
1) AS in the LOAD, as you can currently do for names.
2) using an outside metadata service, where we would tell it the
file name and it would tell us the metadata.
3) Support self describing data formats such as JSON.
You're suggestion for a very simple schema provided in the first line
of the file falls under category 3. The trick here is that we need
to be able to read that metadata about the fields at parse time
(because we'd like to be able to do type checking and such). So in
addition to the load function itself needing to examine the tuples,
we need a way for the load function to read just enough of the file
to tell the front end (on the client box, not on the map-reduce
backend) the schema. Maybe the best way to implement this is to have
an interface that the load function would implement that lets the
parser know that the load function can discover the metadata for it,
and then the parser could call that load function before proceeding
to type checking.
We're also interested in being able to tell the load function the
fields needed in the query. Even if you don't have field per file
storage (aka columnar storage) it's useful to be able to immediately
project out fields you know the query won't care about, as you can
avoid translation costs and memory storage.
It's not clear to me that we need another interface to implement
this. We could just add a method "void neededColumns(Schema s)" to
PigLoader. As a post parsing step the parser would then visit the
plan, as you suggest, and submit a schema to the PigLoader function.
It would be up to the specific loader implementation to decide
whether to make use of the provided schema or not.
I don't see the use for the first new function in addition to the
second. If a schema is required by the query, the loader must be able
to produce data matching that schema. If the loader can figure out an
internal schema, it can make that check that you describe in function
1 in addition to structuring its data correctly as in function 2. If
it can't determine its internal schema until it loads data, then it
can do neither and we have to wait until runtime to see if it
succeeds. What about making the call "Schema neededColumns(Schema s)
throws IOException"? The returned Schema is the actual Schema that
will be loaded which must be a superset of the incoming Schema. If
the loader is unable to create the needed schema, an IOException is
thrown.
Is the necessary Schema known somewhere in the parser, or will I have
to figure that out from the Schemas available at each step? I haven't
seen anything like that.
Charlie
I'm not sure I understand what you're proposing. I was trying to say
that we need two separate things from the load function:
1) A way to discover the schema of the data at parse time for type
checking and query correctness checking (e.g. the user asked for field
5, is there a field 5?) This is needed for metadata option 3, where the
metadata is described by the data (as in JSON) or where the metadata is
located in a file associated with the data. We want to detect these
kinds of errors before we submit to the backend (i.e. Hadoop) so that we
can give the earliest possible error feedback.
2) A way to indicate to the load function the schema it needs to load,
as a way to support columnar storage schemes (such as you propose) or
pushing projection down into the load.
Were you saying that you didn't think one of those is necessary, or are
you saying that you think we can accomplish both with one function being
adding to the load function?
Alan.