On Feb 4, 2008, at 1:42 PM, Alan Gates wrote:
Charlie Groves wrote:
On Jan 18, 2008, at 11:45 AM, Alan Gates wrote:
Our thinking of how to provide field metadata (name and
eventually types) for pig queries was to allow several options:
1) AS in the LOAD, as you can currently do for names.
2) using an outside metadata service, where we would tell it
the file name and it would tell us the metadata.
3) Support self describing data formats such as JSON.
You're suggestion for a very simple schema provided in the first
line of the file falls under category 3. The trick here is that
we need to be able to read that metadata about the fields at
parse time (because we'd like to be able to do type checking and
such). So in addition to the load function itself needing to
examine the tuples, we need a way for the load function to read
just enough of the file to tell the front end (on the client box,
not on the map-reduce backend) the schema. Maybe the best way to
implement this is to have an interface that the load function
would implement that lets the parser know that the load function
can discover the metadata for it, and then the parser could call
that load function before proceeding to type checking.
We're also interested in being able to tell the load function the
fields needed in the query. Even if you don't have field per
file storage (aka columnar storage) it's useful to be able to
immediately project out fields you know the query won't care
about, as you can avoid translation costs and memory storage.
It's not clear to me that we need another interface to implement
this. We could just add a method "void neededColumns(Schema s)"
to PigLoader. As a post parsing step the parser would then visit
the plan, as you suggest, and submit a schema to the PigLoader
function. It would be up to the specific loader implementation
to decide whether to make use of the provided schema or not.
I don't see the use for the first new function in addition to the
second. If a schema is required by the query, the loader must be
able to produce data matching that schema. If the loader can
figure out an internal schema, it can make that check that you
describe in function 1 in addition to structuring its data
correctly as in function 2. If it can't determine its internal
schema until it loads data, then it can do neither and we have to
wait until runtime to see if it succeeds. What about making the
call "Schema neededColumns(Schema s) throws IOException"? The
returned Schema is the actual Schema that will be loaded which
must be a superset of the incoming Schema. If the loader is
unable to create the needed schema, an IOException is thrown.
I'm not sure I understand what you're proposing. I was trying to
say that we need two separate things from the load function:
1) A way to discover the schema of the data at parse time for type
checking and query correctness checking (e.g. the user asked for
field 5, is there a field 5?) This is needed for metadata option
3, where the metadata is described by the data (as in JSON) or
where the metadata is located in a file associated with the data.
We want to detect these kinds of errors before we submit to the
backend (i.e. Hadoop) so that we can give the earliest possible
error feedback.
2) A way to indicate to the load function the schema it needs to
load, as a way to support columnar storage schemes (such as you
propose) or pushing projection down into the load.
Were you saying that you didn't think one of those is necessary, or
are you saying that you think we can accomplish both with one
function being adding to the load function?
I'm saying that both can be accomplished with one new function on the
load func: Schema neededColumns(Schema s) throws IOException. s is
the schema derived from the query, and the load func can use it to
satisfy your first requirement. If it can check its underlying data,
it can then compare it to the schema in s and throw an IOException if
it can't satisfy that. s can also be used to satisfy your second
requirement as it indicates to the load func what it's expected to load.
The returned Schema is the form that the actual data returned by the
load func will take. It must be a superset of the passed in Schema,
and really just exists to allow the load func to say it isn't going
to prune any of the data away at load time and just return everything
that it finds. For load funcs that don't know the structure of their
data until they actually read it, they can return the * schema and
just wait until runtime to see if things blow up just like things
work currently.
I think this makes more sense as a single function because the two
requirements are essentially the same operation. To load enough of
the data to check a given schema against what's actually in the store
is almost the same work to determine what it'll actually load for
requirement two.
Make more sense?
Charlie