Re: Getting query information while loading data

Charlie Groves Fri, 18 Jan 2008 15:57:52 -0800


On Jan 18, 2008, at 11:45 AM, Alan Gates wrote:

We're definitely interested.


Excellent!

Our thinking of how to provide field metadata (name and eventuallytypes) for pig queries was to allow several options:
   1) AS in the LOAD, as you can currently do for names.
2) using an outside metadata service, where we would tell it thefile name and it would tell us the metadata.
   3) Support self describing data formats such as JSON.
You're suggestion for a very simple schema provided in the firstline of the file falls under category 3. The trick here is that weneed to be able to read that metadata about the fields at parsetime (because we'd like to be able to do type checking and such).So in addition to the load function itself needing to examine thetuples, we need a way for the load function to read just enough ofthe file to tell the front end (on the client box, not on the map-reduce backend) the schema. Maybe the best way to implement thisis to have an interface that the load function would implement thatlets the parser know that the load function can discover themetadata for it, and then the parser could call that load functionbefore proceeding to type checking.
We're also interested in being able to tell the load function thefields needed in the query. Even if you don't have field per filestorage (aka columnar storage) it's useful to be able toimmediately project out fields you know the query won't care about,as you can avoid translation costs and memory storage.
It's not clear to me that we need another interface to implementthis. We could just add a method "void neededColumns(Schema s)" toPigLoader. As a post parsing step the parser would then visit theplan, as you suggest, and submit a schema to the PigLoaderfunction. It would be up to the specific loader implementation todecide whether to make use of the provided schema or not.

I don't see the use for the first new function in addition to thesecond. If a schema is required by the query, the loader must beable to produce data matching that schema. If the loader can figureout an internal schema, it can make that check that you describe infunction 1 in addition to structuring its data correctly as infunction 2. If it can't determine its internal schema until it loadsdata, then it can do neither and we have to wait until runtime to seeif it succeeds. What about making the call "Schema neededColumns(Schema s) throws IOException"? The returned Schema is the actualSchema that will be loaded which must be a superset of the incomingSchema. If the loader is unable to create the needed schema, anIOException is thrown.

Is the necessary Schema known somewhere in the parser, or will I haveto figure that out from the Schemas available at each step? Ihaven't seen anything like that.


Charlie

Charlie Groves wrote:
I'd like to expose the running query to my loading code for a fewreasons:
- To allow the schema of the loaded data to be specified by itsusage in the query, rather than by an explicit AS. I know thenames of the fields in my data, so it seems backwards to me torequire it to be named in the query. I'd rather use the dataaccess in the query to figure out the names of the fields and passthat to my loader to put the data in the right place in a tuple.This also seems like it could be nice for CSV data since itgenerally has the names as the first line.
- Following up on using the query to determine the schema, I'dlike to use the query-determined schema to decide what to load.My storage is broken out into files by field, so if I know whichfields are used in a query, I can read only those fields and savea huge amount of busywork.
- To optimize filter operations using indexes. For some of myfields, I have metadata that tells me the range of values in thatfile. If I could find all the filter operations on that field, Icould reject entire files if their values fell outside the filterrange.
Are you interested in some patches to do this sort of thing? Ifso, what's the best way to expose this information to user code?My very basic, initial thinking for the first two use cases is towrite a LOVisitor and an EvalSpecVisitor to spider through thebuilt query and build a schema to pass to an interested loadfunc. A load func indicates its interest by implementing a newinterface that takes the schema, and it takes responsibility formaking a tuple that conforms to the schema. If a load func isn'tinterested, it just implements the current interface and loads allthe data in its input stream.
The final use case seems like it would require exposing EvalFuncsand the LogicalPlan to user code, so I'm fine with just goingafter the first two for now and figuring that out later. However,if there's a way that's exposed already in the code that I'vemissed, or if there's a better way to do it, I'd like to check itout since it'd be hugely beneficial for what I'm doing.
Thanks,
Charlie

Re: Getting query information while loading data

Reply via email to