[ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757979#action_12757979 ]
Dmitriy V. Ryaboy commented on PIG-966: --------------------------------------- The comments below are from both me and Ashutosh. We'd like to preface this with saying that we think overall, the proposed changes are very useful and important, and are likely to result in significantly reducing the barriers to Pig adoption in the broader Hadoop user community. There is a lot of suggestions / critiques below, but that's just because we care :-) On to the notes. *Names of interfaces* Can you explain why everything has a Load prefix? Seems like this limits the interfaces unnecessarily, and is a bit inconsistent semantically (LoadMetadata does not represent metadata associated with loading -- it loads metadata. LoadStatistics does not load statistics; it represents statistics, and is loaded using LoadMetadata). How about: LoadCaster -> PigTypeCaster, LoadPushDown -> Filterable, Projectionable (the latter may need a better name) (clearly, we are also suggesting breaking down the interface into multiple interfaces -- more on that later) LoadSchema -> ResourceSchema LoadFieldSchema -> FieldSchema or ResourceFieldSchema LoadMetadata -> MetadataReader StoreMetadata -> MetadataWriter LoadStatistics -> ResourceStatistics *LoadFunc* In regards to the appropriate parameters for setURI -- can you explain the advantage of this over Strings in more detail? I think the current setLocation approach is preferable; it gives users more flexibility. Plus Hadoop Paths are constructed from strings, not URIs, so we are forcing a string->uri->string conversion on the common case. The _getLoadCaster_ method -- perhaps _getTypeCaster_ or _getPigTypeCaster_ is a better name? _prepareToRead_: does it need a _finishReading()_ mate? I would like to see a "standard" method for getting the jobconf (or whatever it is called in 20/21), both for LoadFunc and StoreFunc. *LoadCaster (Or PigTypeCaster..)* This interface is implemented by UTF8StorageConverter. Let's decide on what these are -- _casters_ or _converters_ -- and use one term. *LoadMetadata (or MetadataLoader)* Some thoughts on the problem of what happens when the loader is loading multiple resources or a resource with multiple partitions. We think that the schema should be uniform for everything a single instance of a loader is responsible for loading (and the loader can fill in null or defaults where appropriate if some resources are missing fields). Statistics should be aggregated, since the collection of resources will be treated as one (knowledge of relevant partitions would be used by a Filterable/Projectionable/Pushdownable loader to push selections down, not, I think, by downstream operators). So we have two options. In option 1, getStatistics would return a collection (lower c) of stats associated with the resources that the loader is loading, perhaps as a Map of String->ResourceStatistics. These would need to go through a stat aggregator of some sort that would know how to deal with unifying statistics across multiple resources in a generic way. In option 2, getStatistics would be responsible for its own implementation of aggregation, which would give it flexibility in terms of how such aggregation is done. Since we don't expect many stat stores, this seems preferable to us, as generic aggregation is going to be hard to get right). (of course there is option 3, where we have a default stat aggregator class that can be extended/overridden by individual MetadataLoaders, but I imagine this would be a hard sell). *LoadSchema (or ResourceSchema)* Should org.apache.pig.impl.logicalLayer.schema.Schema be changed to use this as an internal representation? I like how sort information is handled here. Perhaps we can consider using this approach instead of _SortInfo_ in PIG-953 If _PigSchema_ implements or contains _ResourceSchema_, _SortInfo_ will no longer be needed. _PartitionKeys_ aren't really part of schema; they are a storage/distribution property. This should go into the Metadata and refer to the schema. *LoadStatistics (or ResourceStatistics)* Why the public fields? Not that I am a huge fan of getters and setters but I sense findbugs warnings heading our way. We need to account for some statistics being missing. What should numRecords be set to if we don't know the number of records? We can use Long and set it to null; we can use a magic value (-1?); we can wrap in a getter and throw an exception (ugh). I had envisioned statistics as more of a key-value thing, with some keys predefined in a separate class. So we would have: {code} ResourceStats.NUM_RECORDS ResourceStats.SIZE_IN_BYTES //etc {code} and to get the stats we would call {code} MyResourceStats.getLong(ResourceStats.NUM_RECORDS) MyResourceStats.getFloat(ResourceStats.SOMETHING_THAT_IS_A_FLOAT) //etc {code} This allows us to be far more flexible in regards to the things marked as "//probably more here." Something that is probably worth mentioning in its own topic, but we have been considering extending Pig Latin to allow users to specify some dataset properties on the fly. For example when performing a filter, a user can help planning by saying something like {code} bar = FILTER foo BY myUdf($1) EXPECT SELECTIVITY=0.95 -- just getting rid of some noise! foobar = JOIN foo by id, bar by id EXPECT SELECTIVITY=0.1 -- small cross-section! {code} Having the user provide a rough estimate of operator selectivity would be far more effective than trying to estimate result sizes based on stats. In fact no less an authority than Microsoft's Surajit Chaudhury advocates opening up these sorts of communication channels in SQL to help optimizers do a better job (see his SIGMOD 2009 paper). This implies that statistics can (and should) be associated not just with loaded resources, but with operator outputs. *LoadPushdown* As alluded to above, I am not sure this is a good interface. The idea is that we allow users to define which operations can be pushed down to them; but the concept of a push down is really a Pig concept, not a Load concept. I think breaking this out into two interfaces would be more advisable. So, I propose instead {code} interface Filterable { /** * checks if the plan consists of operations that a loader can push down to the physical level */ public boolean checkFilterPlan(OperatorPlan plan); /** * sets a filter plan that will be pushed down to this loader. * throws exceptions if the plan is poorly formed (what does this mean? tbd) or * if this loader cannot accept this plan for some reason */ public void setFilterPlan(OperatorPlan plan) throws MalformedPlanException, UnacceptablePlanException; /** * same as setFilterPlan, but allows multiple filters to be applied in order. */ public void addFilterPlan(OperatorPlan plan) throws MalformedPlanException, UnacceptablePlanException; } {code} I took out get features as it's implied by implementing one or the other interface (or both). Was hoping a logical plan would be ok for communicating the Filter plan. Agreed that passing sql and having users parse it is cludgy. Can't a simple AST be represented as a (very plain) LogicalPlan? Not sure what our alternatives are here. The projectionable (or Selectable?) interface below is pretty simplified -- it only allows specifying which columns are needed, and does not do things like applying arbitrary UDFs, etc. I think this is better because it is very straightforward to implement. I expect that the common case for projection pushdown is a column store storage engine; in that case all one really needs to know is the indexes of the requested columns to avoid touching irrelevant data. There may be some edge case on-disk structures that allow pushing down projections of nested data, or even some special udfs (materialized views?) but it seems like they are not worth the added complexity of requiring pig users to deal with OperatorPlans. This is not avoidable for Filters, but I think we can make life easier for folks with Projections. {code} interface Projectionable { /** * sets the fields that this loader should return. * @param keys an array of indexes into the Loaders true schema. For example the keys are set to * [2,4,1] then the tuple returned by loader.getNext() will have three fields, the second, fourth, and first */ public void setProjectionKeys(int[] keys) throws IllegalArgumentException; } {code} *StoreFunc* setURI -- same comment as getURI, Location (as a String) seems better. prepareToWrite -- does it need a close() / finish() mate? *StoreMetadata (or, MetadataWriter)* Where does one specify what MetadataWriter to use? Is it inside the StoreFunc? In that case StoreFunc needs a method to return its implementation of MetadataWriter. Is it global? Then we need to specify how it gets set. Same applies to MetadataReader and LoadFunc. Since we may want to insulate the pig user from the metadata business unless he really wants to get into it, it probably makes sense to assume a global default Pig Metadata Repository. In cases when the user writes a loader for data managed by something else (Hive, Hbase), they can specify a metadataReader at the loader. We may want to also allow them to specify the metadataReader at the script level: {code} a = load '/dw/hive/foo' using HiveLoader() with metadataReader HiveMetadataReader(some params) as (....) {code} I don't really have any particularly good ideas on the questions you raise regarding the specifics of how the interface should work. InputFormatLoader: I think the types table can be extended to support ArrayWritable and MapWritable as long as array members and key/value types are among the types listed. As far as needing to do something special for loaders like BinStorage and JSONLoader -- can't they get an underlying inputformat on the front end the same way the side files are proposed to be handled (new instance of IF, getSplits, RecordReader, read)? Regarding using InputFormat et al without Hadoop: I can't claim to have a detailed understanding of the underlying issues, but it seems to me that those things are just interfaces and can be divorced from HDFS by creating implementations that deal with the local filesystem directly. Or is the idea to be able to run completely without the Hadoop libraries? Whew. That's all for now. Thanks for reading the lengthy comments, and looking forward to continued discussion. > Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces > --------------------------------------------------------------- > > Key: PIG-966 > URL: https://issues.apache.org/jira/browse/PIG-966 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Alan Gates > Assignee: Alan Gates > > I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces > significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for > full details -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.