[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

Dmitriy V. Ryaboy (JIRA) Mon, 21 Sep 2009 11:26:39 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757979#action_12757979
 ]


Dmitriy V. Ryaboy commented on PIG-966:
---------------------------------------

The comments below are from both me and Ashutosh.

We'd like to preface this with saying that we think overall, the proposed 
changes are very useful and important, and are likely to result in 
significantly reducing the barriers to Pig adoption in the broader Hadoop user 
community.

There is a lot of suggestions / critiques below, but that's just because we 
care :-)

On to the notes.

*Names of interfaces*

Can you explain why everything has a Load prefix? Seems like this limits the 
interfaces unnecessarily, and is a bit inconsistent semantically (LoadMetadata 
does not represent metadata associated with loading -- it loads metadata. 
LoadStatistics does not load statistics; it represents statistics, and is 
loaded using LoadMetadata).

How about:
LoadCaster -> PigTypeCaster,

LoadPushDown -> Filterable, Projectionable (the latter may need a better name)
(clearly, we are also suggesting breaking down the interface into multiple 
interfaces -- more on that later)

LoadSchema -> ResourceSchema
LoadFieldSchema -> FieldSchema or ResourceFieldSchema
LoadMetadata -> MetadataReader
StoreMetadata -> MetadataWriter
LoadStatistics -> ResourceStatistics


*LoadFunc*

In regards to the appropriate parameters for setURI -- can you explain the 
advantage of this over Strings in more detail? I think the current setLocation 
approach is preferable; it gives users more flexibility. Plus Hadoop Paths are 
constructed from strings, not URIs, so we are forcing a string->uri->string 
conversion on the common case. 

The _getLoadCaster_ method -- perhaps _getTypeCaster_ or _getPigTypeCaster_ is 
a better name?

_prepareToRead_: does it need a _finishReading()_ mate?

I would like to see a "standard" method for getting the jobconf (or whatever it 
is called in 20/21), both for LoadFunc and StoreFunc.


*LoadCaster (Or PigTypeCaster..)*

This interface is implemented by UTF8StorageConverter.  Let's decide on what 
these are -- _casters_ or _converters_ -- and use one term.

*LoadMetadata (or MetadataLoader)*

Some thoughts on the problem of what happens when the loader is loading 
multiple resources or a resource with multiple partitions.

We think that the schema should be uniform for everything a single instance of 
a loader is responsible for loading (and the loader can fill in null or 
defaults where appropriate if some resources are missing fields).

Statistics should be aggregated, since the collection of resources will be 
treated as one (knowledge of relevant partitions would be used by a 
Filterable/Projectionable/Pushdownable loader to push selections down, not, I 
think, by downstream operators).

So we have two options.

In option 1, getStatistics would return a collection (lower c) of stats 
associated with the resources that the loader is loading, perhaps as a Map of 
String->ResourceStatistics. These would need to go through a stat aggregator of 
some sort that would know how to deal with unifying statistics across multiple 
resources in a generic way.

In option 2, getStatistics would be responsible for its own implementation of 
aggregation, which would give it flexibility in terms of how such aggregation 
is done. Since we don't expect many stat stores, this seems preferable to us, 
as generic aggregation is going to be hard to get right).

(of course there is option 3, where we have a default stat aggregator class 
that can be extended/overridden by individual MetadataLoaders, but I imagine 
this would be a hard sell).


*LoadSchema (or ResourceSchema)*

Should org.apache.pig.impl.logicalLayer.schema.Schema be changed to use this as 
an internal representation?

I like how sort information is handled here. Perhaps we can consider using this 
approach instead of _SortInfo_ in PIG-953 
If _PigSchema_ implements or contains _ResourceSchema_, _SortInfo_ will no 
longer be needed.

_PartitionKeys_ aren't really part of schema; they are a storage/distribution 
property. This should go into the Metadata and refer to the schema.

*LoadStatistics (or ResourceStatistics)*

Why the public fields? Not that I am a huge fan of getters and setters but I 
sense findbugs warnings heading our way.

We need to account for some statistics being missing. What should numRecords be 
set to if we don't know the number of records? 

We can use Long and set it to null; 
we can use a magic value (-1?); 
we can wrap in a getter and throw an exception (ugh).

I had envisioned statistics as more of a key-value thing, with some keys 
predefined in a separate class. So we would have:

{code}
ResourceStats.NUM_RECORDS
ResourceStats.SIZE_IN_BYTES
//etc
{code}

and to get the stats we would call
{code}
MyResourceStats.getLong(ResourceStats.NUM_RECORDS)
MyResourceStats.getFloat(ResourceStats.SOMETHING_THAT_IS_A_FLOAT)
//etc
{code}

This allows us to be far more flexible in regards to the things marked as 
"//probably more here."

Something that is probably worth mentioning in its own topic, but we have been 
considering extending Pig Latin to allow users to specify some dataset 
properties on the fly. For example when performing a filter, a user can help 
planning by saying something like
{code}
bar = FILTER foo BY myUdf($1) EXPECT SELECTIVITY=0.95 -- just getting rid of 
some noise!
foobar = JOIN foo by id, bar by id EXPECT SELECTIVITY=0.1 -- small 
cross-section!
{code}

Having the user provide a rough estimate of operator selectivity would be far 
more effective than trying to estimate result sizes based on stats.  In fact no 
less an authority than Microsoft's Surajit Chaudhury advocates opening up these 
sorts of communication channels in SQL to help optimizers do a better job (see 
his SIGMOD 2009 paper).

This implies that statistics can (and should) be associated not just with 
loaded resources, but with operator outputs.


*LoadPushdown*

As alluded to above, I am not sure this is a good interface.  The idea is that 
we allow users to define which operations can be pushed down to them; but the 
concept of a push down is really a Pig concept, not a Load concept.  I think 
breaking this out into two interfaces would be more advisable.

So, I propose instead

{code}
interface Filterable {

 /**
 * checks if the plan consists of operations that a loader can push down to the 
physical level
 */
 public boolean checkFilterPlan(OperatorPlan plan);

 /**
 *  sets a filter plan that will be pushed down to this loader.
 *  throws exceptions if the plan is poorly formed (what does this mean? tbd) or
 *  if this loader cannot accept this plan for some reason
 */
 public void setFilterPlan(OperatorPlan plan) throws MalformedPlanException, 
UnacceptablePlanException;

 /**
 *  same as setFilterPlan, but allows multiple filters to be applied in order.
 */
 public void addFilterPlan(OperatorPlan plan) throws MalformedPlanException, 
UnacceptablePlanException;

}

{code}

I took out get features as it's implied by implementing one or the other 
interface (or both).

Was hoping a logical plan would be ok for communicating the Filter plan. Agreed 
that passing sql and having users parse it is cludgy. Can't a simple AST be 
represented as a (very plain) LogicalPlan? Not sure what our alternatives are 
here.

The projectionable (or Selectable?) interface below is pretty simplified -- it 
only allows specifying which columns are needed, and does not do things like 
applying arbitrary UDFs, etc. I think this is better because it is very 
straightforward to implement. I expect that the common case for projection 
pushdown is a column store storage engine; in that case all one really needs to 
know is the indexes of the requested columns to avoid touching irrelevant data. 
There may be some edge case on-disk structures that allow pushing down 
projections of nested data, or even some special udfs (materialized views?) but 
it seems like they are not worth the added complexity of requiring pig users to 
deal with OperatorPlans.  This is not avoidable for Filters, but I think we can 
make life easier for folks with Projections.

{code}
interface Projectionable {

 /**
 *  sets the fields that this loader should return.
 *  @param keys an array of indexes into the Loaders true schema. For example 
the keys are set to
 *  [2,4,1] then the tuple returned by loader.getNext() will have three fields, 
the second, fourth, and first
 */
 public void setProjectionKeys(int[] keys) throws IllegalArgumentException;

}
{code}

*StoreFunc*

setURI -- same comment as getURI, Location (as a String) seems better.

prepareToWrite -- does it need a close() / finish() mate?


*StoreMetadata (or, MetadataWriter)*

Where does one specify what MetadataWriter to use? Is it inside the StoreFunc? 
In that case StoreFunc needs a method to return its implementation of 
MetadataWriter. Is it global? Then we need to specify how it gets set.  Same 
applies to MetadataReader and LoadFunc.

Since we may want to insulate the pig user from the metadata business unless he 
really wants to get into it, it probably makes sense to assume a global default 
Pig Metadata Repository.  In cases when the user writes a loader for data 
managed by something else (Hive, Hbase), they can specify a metadataReader at 
the loader.  We may want to also allow them to specify the metadataReader at 
the script level:
{code}
a = load '/dw/hive/foo' using HiveLoader() with metadataReader 
HiveMetadataReader(some params) as (....)
{code}

I don't really have any particularly good ideas on the questions you raise 
regarding the specifics of how the interface should work.

InputFormatLoader:

I think the types table can be extended to support ArrayWritable and 
MapWritable as long as array members and key/value types are among the
types listed.

As far as needing to do something special for loaders like BinStorage and 
JSONLoader -- can't they get an underlying inputformat on the front end the 
same way the side files are proposed to be handled (new instance of IF, 
getSplits, RecordReader, read)?

Regarding using InputFormat et al without Hadoop:

I can't claim to have a detailed understanding of the underlying issues, but it 
seems to me that those things are just interfaces and can be divorced from HDFS 
by creating implementations that deal with the local filesystem directly. Or is 
the idea to be able to run completely without the Hadoop libraries?

Whew. That's all for now.

Thanks for reading the lengthy comments, and looking forward to continued 
discussion.



> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> ---------------------------------------------------------------
>
>                 Key: PIG-966
>                 URL: https://issues.apache.org/jira/browse/PIG-966
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
> significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
> full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

Reply via email to