Re: HCat-specific JsonSerDe plan

Mirko Kämpf Wed, 01 Aug 2012 11:42:01 -0700

Thanks a lot.

I will work these things out in more detail and share docs
via Google-Docs or is there a better place in the HCatalog
project space for sharing requirements and design docs?


Best wishes,
Mirko


On Wed, Aug 1, 2012 at 8:13 PM, Travis Crawford <[email protected]>wrote:

> Hey Mirko -
>
> From my perspective the primary goals are:
>
> * Open up the HiveMetaStore to all processing frameworks, so users
> don't need to know the data format or where its physically stored.
> * Standardize the data RW path for all processing frameworks.
>
> Basically any data usable by Hive is usable by other processing
> frameworks, and vice versa.
>
> Currently the HiveMetaStore contains metadata about your data (input
> format, serde, columns, ...) but does not contain metadata about who's
> using the data. That's actually something that I'm very interested in
> looking at in the future.
>
> Regarding your current issue, arbitrary properties can be stored
> per-table, and per-partition. These could perhaps be used to link your
> data to particular wiki pages, but at this time theres no such
> functionality in HCat itself.
>
> --travis
>
>
>
> On Wed, Aug 1, 2012 at 10:14 AM, Mirko Kämpf <[email protected]>
> wrote:
> > Hello,
> >
> > I have a question to the focus of the HCatalog project, as I am not sure
> if
> > I am looking at the right place, here in the HCatalog project.
> >
> > My task is, to solve problems of project specific meta-data handling and
> > data life cycles. For a research project we
> > have a collaborativ wiki solution to edit and our dataset descriptions
> and
> > procedure documentations for data analysis and data preparation. That
> > means, we do not just have data for different time periods, we also use
> > different algorithms to aggregate or filter the data in to different
> shapes
> > for a later comarison.
> >
> > One possible solution would be, to write well documented Hive or Pig
> > scripts, to do the stuff, but than we have to track all the scripts and
> > over the time the head will explode...
> > So there is a question, if we could map the description in our
> docu-system
> > directly to metadata in Hive (not sure if Pig has such metadata as well)
> or
> > if the HCatalog project would be the right place for linking our dock
> > workspace to.
> >
> > Did I understand the aim of HCatalog right: It is a toolset to provide a
> > fluent interaction between data sources and several processing systems
> > (Pig, Hive, MR) and it is not a tool for storing metedata (e.g. by what
> > tool was a dataset created from what raw-dataset in what time on what
> > machine?)
> >
> > For a programmer these questions might not be so interesting but as one
> > wants to optimize business use cases it would be helpful to have such
> > metadata generated by the script or job. Based on this metadata we could
> > compare cluster simulation results to real world (meta)data.
> >
> > Is there something like this known or would it be a good point to start
> > such a project based on our (semi)manuell experiences in data life cycle
> > tools?
> >
> > Best wishes,
> >
> > Mirko
>



-- 
--
Mirko Kämpf

+49 176 20 63 51 99
[email protected]

Re: HCat-specific JsonSerDe plan

Reply via email to