Re: Getting query information while loading data

Alan Gates Mon, 04 Feb 2008 13:43:08 -0800

Charlie,

I apologize, I got busy and let this thread drop.  Comments inlined below.


Charlie Groves wrote:

On Jan 18, 2008, at 11:45 AM, Alan Gates wrote:
We're definitely interested.
Excellent!
Our thinking of how to provide field metadata (name and eventuallytypes) for pig queries was to allow several options:
   1) AS in the LOAD, as you can currently do for names.
2) using an outside metadata service, where we would tell it thefile name and it would tell us the metadata.
   3) Support self describing data formats such as JSON.
You're suggestion for a very simple schema provided in the first lineof the file falls under category 3. The trick here is that we needto be able to read that metadata about the fields at parse time(because we'd like to be able to do type checking and such). So inaddition to the load function itself needing to examine the tuples,we need a way for the load function to read just enough of the fileto tell the front end (on the client box, not on the map-reducebackend) the schema. Maybe the best way to implement this is to havean interface that the load function would implement that lets theparser know that the load function can discover the metadata for it,and then the parser could call that load function before proceedingto type checking.
We're also interested in being able to tell the load function thefields needed in the query. Even if you don't have field per filestorage (aka columnar storage) it's useful to be able to immediatelyproject out fields you know the query won't care about, as you canavoid translation costs and memory storage.
It's not clear to me that we need another interface to implementthis. We could just add a method "void neededColumns(Schema s)" toPigLoader. As a post parsing step the parser would then visit theplan, as you suggest, and submit a schema to the PigLoader function.It would be up to the specific loader implementation to decidewhether to make use of the provided schema or not.
I don't see the use for the first new function in addition to thesecond. If a schema is required by the query, the loader must be ableto produce data matching that schema. If the loader can figure out aninternal schema, it can make that check that you describe in function1 in addition to structuring its data correctly as in function 2. Ifit can't determine its internal schema until it loads data, then itcan do neither and we have to wait until runtime to see if itsucceeds. What about making the call "Schema neededColumns(Schema s)throws IOException"? The returned Schema is the actual Schema thatwill be loaded which must be a superset of the incoming Schema. Ifthe loader is unable to create the needed schema, an IOException isthrown.
Is the necessary Schema known somewhere in the parser, or will I haveto figure that out from the Schemas available at each step? I haven'tseen anything like that.
Charlie

I'm not sure I understand what you're proposing. I was trying to saythat we need two separate things from the load function:1) A way to discover the schema of the data at parse time for typechecking and query correctness checking (e.g. the user asked for field5, is there a field 5?) This is needed for metadata option 3, where themetadata is described by the data (as in JSON) or where the metadata islocated in a file associated with the data. We want to detect thesekinds of errors before we submit to the backend (i.e. Hadoop) so that wecan give the earliest possible error feedback.2) A way to indicate to the load function the schema it needs to load,as a way to support columnar storage schemes (such as you propose) orpushing projection down into the load.

Were you saying that you didn't think one of those is necessary, or areyou saying that you think we can accomplish both with one function beingadding to the load function?


Alan.

Re: Getting query information while loading data

Reply via email to