Either sounds fine. I'm interested in what your use case it. --travis
On Wed, Aug 1, 2012 at 11:41 AM, Mirko Kämpf <[email protected]> wrote: > Thanks a lot. > > I will work these things out in more detail and share docs > via Google-Docs or is there a better place in the HCatalog > project space for sharing requirements and design docs? > > Best wishes, > Mirko > > > On Wed, Aug 1, 2012 at 8:13 PM, Travis Crawford > <[email protected]>wrote: > >> Hey Mirko - >> >> From my perspective the primary goals are: >> >> * Open up the HiveMetaStore to all processing frameworks, so users >> don't need to know the data format or where its physically stored. >> * Standardize the data RW path for all processing frameworks. >> >> Basically any data usable by Hive is usable by other processing >> frameworks, and vice versa. >> >> Currently the HiveMetaStore contains metadata about your data (input >> format, serde, columns, ...) but does not contain metadata about who's >> using the data. That's actually something that I'm very interested in >> looking at in the future. >> >> Regarding your current issue, arbitrary properties can be stored >> per-table, and per-partition. These could perhaps be used to link your >> data to particular wiki pages, but at this time theres no such >> functionality in HCat itself. >> >> --travis >> >> >> >> On Wed, Aug 1, 2012 at 10:14 AM, Mirko Kämpf <[email protected]> >> wrote: >> > Hello, >> > >> > I have a question to the focus of the HCatalog project, as I am not sure >> if >> > I am looking at the right place, here in the HCatalog project. >> > >> > My task is, to solve problems of project specific meta-data handling and >> > data life cycles. For a research project we >> > have a collaborativ wiki solution to edit and our dataset descriptions >> and >> > procedure documentations for data analysis and data preparation. That >> > means, we do not just have data for different time periods, we also use >> > different algorithms to aggregate or filter the data in to different >> shapes >> > for a later comarison. >> > >> > One possible solution would be, to write well documented Hive or Pig >> > scripts, to do the stuff, but than we have to track all the scripts and >> > over the time the head will explode... >> > So there is a question, if we could map the description in our >> docu-system >> > directly to metadata in Hive (not sure if Pig has such metadata as well) >> or >> > if the HCatalog project would be the right place for linking our dock >> > workspace to. >> > >> > Did I understand the aim of HCatalog right: It is a toolset to provide a >> > fluent interaction between data sources and several processing systems >> > (Pig, Hive, MR) and it is not a tool for storing metedata (e.g. by what >> > tool was a dataset created from what raw-dataset in what time on what >> > machine?) >> > >> > For a programmer these questions might not be so interesting but as one >> > wants to optimize business use cases it would be helpful to have such >> > metadata generated by the script or job. Based on this metadata we could >> > compare cluster simulation results to real world (meta)data. >> > >> > Is there something like this known or would it be a good point to start >> > such a project based on our (semi)manuell experiences in data life cycle >> > tools? >> > >> > Best wishes, >> > >> > Mirko >> > > > > -- > -- > Mirko Kämpf > > +49 176 20 63 51 99 > [email protected]
