Hi Sushanth, Thanks for your comments. See my response inline.
-Francis On 7/20/11 1:26 PM, "Sushanth Sowmyan" <[email protected]> wrote: > Hi Francis, > > Apologies for not following up earlier, I've been a bit busy off late, > and while I saw your questions and had a look at the classes, I > couldn't follow up at greater length, especially considering that > HCATALOG-42, which has been waiting on committing changes a bunch of > those classes. And on that end, I've not looked at your patches yet, > but will soon. > > Regarding your comments: > > a) For configuration parameters - HCatTableInfo by itself is not > supposed to contain any parameters specific to any storage drivers. > The reason for this is that HCatTableInfo is how the M/R programmer > passes on info to HCatInputFormat, and thus, should not contain > anything specific to any storage driver implementation as you mention. > So, there is already a place for that, and that is stored in the > table(and partition metadata), in the Table and Partition objects, as > Table.getStorageDescriptor.getParameters() and > Partition.getStorageDescriptor.getParameters(). This is read by > HCatInputFormat/HCatOutputFormat and passed on to the respective ISD > OSD as part of the initialize() call and also in the getInputFormat() > and getOutputFormat() calls, and all properties have a hcat.* keyname. > Have a look at PigStorageInputDriver as an example - it reads a delim > parameter. Or the RCFileInput/OutputDriver. > What we are looking for is a way for the M/R programmer to pass implementation specific parameters to the actually input/output format. In the case of Hbase a user might want to specify a specific timestamp/version number to use for all the writes as part of the M/R job. This could also be looked as a storage type specific (versioned data stores). Also as a possible in enhancement we might also want to expose more of the Hbase api to the user ie specifying their own scan object. My colleague Vandana was kind enough to export an internal twiki which is one of the drafts describing what we are trying to do. Though I don’t agree with putting more specific properties as part of HCatTableInfo class (setScan,OutputVersion, etc) that’s just an implementation issue. It should give you a better idea on what we are trying to do. https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+HBase+Integration+Design > b) JobInfo/OutputJobInfo - Agreed completely, and this is historical, > and in need of cleaning up. We used to have two different JobInfos, > and we tried to refactor it to be just one, and called it JobInfo. As > time progressed, someone needed something extra, and introduced > OutputJobInfo. This does need a proper refactor, and I'm glad to take > suggestions. > The way I did it in the patch is have HCatTableInfo be an abstraction of Hive’s Table class. And the actual job specific parameters are stored in the *JobInfo class. The change is not that drastic but it does make the roles of each class much more consistent. As for *JobInfo It does make sense to have two classes since there are parameters that is unique to each mode. > c) StorageDriver.initialize() is called at several portions of the > code - I have to check to see if passing it separately makes sense, > especially if it's already wrapped inside the context, as it currently > is. I'll get back to you on this after some checking. Thanks. BTW I noticed that hcatProperties only contains properties that start with “hcat.” from what is stored in the metastore. What was the reasoning behind this? > Thanks again for the interest, and I hope we can refactor this to be > cleaner and easier. :) > Thanks as well. I’m hoping we can come to some understanding as to how the user supplied implementation specific parameters could be passed so we won’t diverge to much. > Thanks, > -Sushanth > > On Wed, Jul 20, 2011 at 10:24 AM, Francis Christopher Liu > <[email protected]> wrote: >> Hi, >> >> I've submitted a patch for my concern about the three classes: >> >> HCATALOG-64 >> >> I've also submitted a small patch which is needed to support tables which >> are not stored as files: >> >> HCATALOG-60 >> >> Let me know what you guys think. >> >> -Francis >> >> >> On 7/15/11 3:09 PM, "Francis Christopher Liu" <[email protected]> wrote: >> >>> Hi, >>> >>> I’m part of a team that’s working on adding Hbase support to hcatalog. We’re >>> just getting our feet wet with the source code. And have some questions. Any >>> help would be appreciated to get things going. >>> >>> As part of writing the storage drivers for Hbase we need to add a few more >>> configuration parameters (ie range of versions to read, version number to >>> use >>> when writing, etc). Since setInput/setOutput takes in HCatTableInfo as a >>> parameter. It would seem this is the right place to put it? Also when adding >>> parameters it wouldn’t be good design to put implementation specific >>> parameters into HCatTableInfo. So would it be better to subclass this class >>> or >>> add a Properties field to store such information? >>> >>> JobInfo seems to only be used for Input as OutputJobInfo is for output. >>> Shouldn’t we rename the class to InputJobInfo? Also JobInfo doesn’t have a >>> reference to HCatTableInfo while OutJobInfo does info does. Given this is >>> the >>> Hcat context used by the storage drivers shouldn’t it be there? >>> >>> As for the role of the classes it seems to me that it would make much more >>> sense to have *JobInfo passed as the parameter for setInput/setOutput. Looks >>> to me, HCatTableInfo should contain the state of things as persisted in the >>> metastore while *JobInfo classes should contain the job-specific >>> information? >>> We could have a factory method which creates *JobInfo object as well as it’s >>> referenced HCaTableInfo object. >>> >>> Also *StorageDriver.initialize() is not passed the *JobInfo. I know it’s >>> possible to deserialize the object from Context object but wouldn’t it be >>> cleaner to just pass it? >>> >>> Let me know what you guys think. Feel free to point out misinterpretations I >>> have made this’ll help us understand better how things work together. >>> >>> -Francis >>> >>> >> >>
