Hey Mirko - >From my perspective the primary goals are:
* Open up the HiveMetaStore to all processing frameworks, so users don't need to know the data format or where its physically stored. * Standardize the data RW path for all processing frameworks. Basically any data usable by Hive is usable by other processing frameworks, and vice versa. Currently the HiveMetaStore contains metadata about your data (input format, serde, columns, ...) but does not contain metadata about who's using the data. That's actually something that I'm very interested in looking at in the future. Regarding your current issue, arbitrary properties can be stored per-table, and per-partition. These could perhaps be used to link your data to particular wiki pages, but at this time theres no such functionality in HCat itself. --travis On Wed, Aug 1, 2012 at 10:14 AM, Mirko Kämpf <[email protected]> wrote: > Hello, > > I have a question to the focus of the HCatalog project, as I am not sure if > I am looking at the right place, here in the HCatalog project. > > My task is, to solve problems of project specific meta-data handling and > data life cycles. For a research project we > have a collaborativ wiki solution to edit and our dataset descriptions and > procedure documentations for data analysis and data preparation. That > means, we do not just have data for different time periods, we also use > different algorithms to aggregate or filter the data in to different shapes > for a later comarison. > > One possible solution would be, to write well documented Hive or Pig > scripts, to do the stuff, but than we have to track all the scripts and > over the time the head will explode... > So there is a question, if we could map the description in our docu-system > directly to metadata in Hive (not sure if Pig has such metadata as well) or > if the HCatalog project would be the right place for linking our dock > workspace to. > > Did I understand the aim of HCatalog right: It is a toolset to provide a > fluent interaction between data sources and several processing systems > (Pig, Hive, MR) and it is not a tool for storing metedata (e.g. by what > tool was a dataset created from what raw-dataset in what time on what > machine?) > > For a programmer these questions might not be so interesting but as one > wants to optimize business use cases it would be helpful to have such > metadata generated by the script or job. Based on this metadata we could > compare cluster simulation results to real world (meta)data. > > Is there something like this known or would it be a good point to start > such a project based on our (semi)manuell experiences in data life cycle > tools? > > Best wishes, > > Mirko
