Re: HCat-specific JsonSerDe plan

Travis Crawford Wed, 01 Aug 2012 11:45:47 -0700

Either sounds fine. I'm interested in what your use case it.

--travis



On Wed, Aug 1, 2012 at 11:41 AM, Mirko Kämpf <[email protected]> wrote:
> Thanks a lot.
>
> I will work these things out in more detail and share docs
> via Google-Docs or is there a better place in the HCatalog
> project space for sharing requirements and design docs?
>
> Best wishes,
> Mirko
>
>
> On Wed, Aug 1, 2012 at 8:13 PM, Travis Crawford 
> <[email protected]>wrote:
>
>> Hey Mirko -
>>
>> From my perspective the primary goals are:
>>
>> * Open up the HiveMetaStore to all processing frameworks, so users
>> don't need to know the data format or where its physically stored.
>> * Standardize the data RW path for all processing frameworks.
>>
>> Basically any data usable by Hive is usable by other processing
>> frameworks, and vice versa.
>>
>> Currently the HiveMetaStore contains metadata about your data (input
>> format, serde, columns, ...) but does not contain metadata about who's
>> using the data. That's actually something that I'm very interested in
>> looking at in the future.
>>
>> Regarding your current issue, arbitrary properties can be stored
>> per-table, and per-partition. These could perhaps be used to link your
>> data to particular wiki pages, but at this time theres no such
>> functionality in HCat itself.
>>
>> --travis
>>
>>
>>
>> On Wed, Aug 1, 2012 at 10:14 AM, Mirko Kämpf <[email protected]>
>> wrote:
>> > Hello,
>> >
>> > I have a question to the focus of the HCatalog project, as I am not sure
>> if
>> > I am looking at the right place, here in the HCatalog project.
>> >
>> > My task is, to solve problems of project specific meta-data handling and
>> > data life cycles. For a research project we
>> > have a collaborativ wiki solution to edit and our dataset descriptions
>> and
>> > procedure documentations for data analysis and data preparation. That
>> > means, we do not just have data for different time periods, we also use
>> > different algorithms to aggregate or filter the data in to different
>> shapes
>> > for a later comarison.
>> >
>> > One possible solution would be, to write well documented Hive or Pig
>> > scripts, to do the stuff, but than we have to track all the scripts and
>> > over the time the head will explode...
>> > So there is a question, if we could map the description in our
>> docu-system
>> > directly to metadata in Hive (not sure if Pig has such metadata as well)
>> or
>> > if the HCatalog project would be the right place for linking our dock
>> > workspace to.
>> >
>> > Did I understand the aim of HCatalog right: It is a toolset to provide a
>> > fluent interaction between data sources and several processing systems
>> > (Pig, Hive, MR) and it is not a tool for storing metedata (e.g. by what
>> > tool was a dataset created from what raw-dataset in what time on what
>> > machine?)
>> >
>> > For a programmer these questions might not be so interesting but as one
>> > wants to optimize business use cases it would be helpful to have such
>> > metadata generated by the script or job. Based on this metadata we could
>> > compare cluster simulation results to real world (meta)data.
>> >
>> > Is there something like this known or would it be a good point to start
>> > such a project based on our (semi)manuell experiences in data life cycle
>> > tools?
>> >
>> > Best wishes,
>> >
>> > Mirko
>>
>
>
>
> --
> --
> Mirko Kämpf
>
> +49 176 20 63 51 99
> [email protected]

Re: HCat-specific JsonSerDe plan

Reply via email to