Thanks a lot. I will work these things out in more detail and share docs via Google-Docs or is there a better place in the HCatalog project space for sharing requirements and design docs?
Best wishes, Mirko On Wed, Aug 1, 2012 at 8:13 PM, Travis Crawford <[email protected]>wrote: > Hey Mirko - > > From my perspective the primary goals are: > > * Open up the HiveMetaStore to all processing frameworks, so users > don't need to know the data format or where its physically stored. > * Standardize the data RW path for all processing frameworks. > > Basically any data usable by Hive is usable by other processing > frameworks, and vice versa. > > Currently the HiveMetaStore contains metadata about your data (input > format, serde, columns, ...) but does not contain metadata about who's > using the data. That's actually something that I'm very interested in > looking at in the future. > > Regarding your current issue, arbitrary properties can be stored > per-table, and per-partition. These could perhaps be used to link your > data to particular wiki pages, but at this time theres no such > functionality in HCat itself. > > --travis > > > > On Wed, Aug 1, 2012 at 10:14 AM, Mirko Kämpf <[email protected]> > wrote: > > Hello, > > > > I have a question to the focus of the HCatalog project, as I am not sure > if > > I am looking at the right place, here in the HCatalog project. > > > > My task is, to solve problems of project specific meta-data handling and > > data life cycles. For a research project we > > have a collaborativ wiki solution to edit and our dataset descriptions > and > > procedure documentations for data analysis and data preparation. That > > means, we do not just have data for different time periods, we also use > > different algorithms to aggregate or filter the data in to different > shapes > > for a later comarison. > > > > One possible solution would be, to write well documented Hive or Pig > > scripts, to do the stuff, but than we have to track all the scripts and > > over the time the head will explode... > > So there is a question, if we could map the description in our > docu-system > > directly to metadata in Hive (not sure if Pig has such metadata as well) > or > > if the HCatalog project would be the right place for linking our dock > > workspace to. > > > > Did I understand the aim of HCatalog right: It is a toolset to provide a > > fluent interaction between data sources and several processing systems > > (Pig, Hive, MR) and it is not a tool for storing metedata (e.g. by what > > tool was a dataset created from what raw-dataset in what time on what > > machine?) > > > > For a programmer these questions might not be so interesting but as one > > wants to optimize business use cases it would be helpful to have such > > metadata generated by the script or job. Based on this metadata we could > > compare cluster simulation results to real world (meta)data. > > > > Is there something like this known or would it be a good point to start > > such a project based on our (semi)manuell experiences in data life cycle > > tools? > > > > Best wishes, > > > > Mirko > -- -- Mirko Kämpf +49 176 20 63 51 99 [email protected]
