Re: [DISCUSS] Proposed HCatalog roadmap

Travis Crawford Tue, 16 Oct 2012 08:42:04 -0700

Very good list Rohini. Since the thread is getting a bit long I
consolidated the suggestions so far in a Wishlist section of the wiki
page, so we don't lose track of them.


    
https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+Roadmap#HCatalogRoadmap-Wishlist

Some comments inline:


On Mon, Oct 15, 2012 at 10:44 AM, Rohini Palaniswamy
<[email protected]> wrote:
> Alan,
>
>   The proposed roadmap looks good. I especially like the idea of having a
> registry so that every user does not have to try and find the required
> correct serde jars as is happening with pig loaders/storer jars today. Also
> +1 on hosting thrift + webhcat servers in one jvm.
>
>   But there are few other things that would be good to have in the roadmap.
> Have listed them below.
>
> Multiple cluster capabilities:
>    * Process data from one cluster and store into another cluster. i.e
> HCatLoader() reads from one cluster's hcat server and HCatStorer() writes
> to another hcat server. This is something that we would be needing very
> soon.

This is something we'll need too. Related to your export/import
comments, this is the direction I hope to take our data replicating
tools in, rather than separate distcp for the data, and metadata
import/export. A copy tool would have some policy about what to pull
from remote clusters, would scan or be notified of new partitions to
copy, and kicks of an HCat MR job to pull the data over. An added
benefit is data provenance info can be recorded to show the full
history for a particular analysis. Is this approach one that sounds
good to you, or is there a particular reason you want to copy data &
metadata separately?


>    * HCat Metastore server to handle multiple cluster metadata. We would
> eventually like to have one hcat server per colo.

Didn't the Hive folks look at this at some point, then abandon it. Any
context here?


>
> Export/Import:
>    * Ability to import/export metadata out of hcat. Some level of support
> for this was added in hcat 0.2 but currently broken. This can be useful for
> user backing up their hcat server metadata, ability to import into another
> hcat metastore server, etc instead of having to replay add/alter
> partitions. For eg: We would like to move a project from one cluster to
> another and it has 1+ years worth of data in hcat server. Copying the data
> can be easily done with distcp by copying the toplevel table directory but
> copying the metadata is going to be more cumbersome.
>
> Data Provenance/Lineage and Data Discovery:
>    * Ability to store data provenance/lineage apart from statistics on the
> data.
>    * Ability to discover data. For eg: If a new user needs to know where
> click data for search ads is stored, he needs to go through twikis or mail
> userlists to find where exactly it is stored. Need the ability to query on
> keywords/producer of data and find which table contains the data.

This would be interesting to discuss further in the future. We're
already doing some of this, but its certainly something any HCat user
would be interested in.

--travis



> Regards,
> Rohini
>
> On Mon, Oct 15, 2012 at 8:27 AM, Alan Gates <[email protected]> wrote:
>
>> Travis,
>>
>> Thanks for the thought provoking feedback.  Comments inline.
>>
>>
>> On Oct 13, 2012, at 9:55 AM, Travis Crawford wrote:
>>
>> > Hey Alan -
>> >
>> > Thanks for putting this roadmap together, its definitely well thought
>> > out and appreciated that you took the time to do this. Some
>> > questions/comments below:
>> >
>> >
>> > METADATA
>> >
>> > These seem like the key items that make hcat attractive – shared table
>> > service & read/write path for all processing frameworks. Something I'm
>> > curious about are your thoughts for dealing with
>> > unstructured/semi-structured data. So far I've personally considered
>> > these outside the scope of hcat because the filesystem already does
>> > that well. The table service really helps when you know stuff about
>> > your data. But if you don't know anything about it, why put it in the
>> > table service?
>>
>> A few thoughts:
>> 1) There's a range between relationally structured data and data I know
>> nothing about.  I may know some things about the data (most rows have a
>> user column) without knowing everything.
>> 2) Schema knowledge is not the only value of a table service.  Shared
>> access paradigms and data models are also valuable.  So even for data where
>> the schema is not known to the certainty you would expect in a relational
>> store there is value in having a single access paradigm between tools.
>> 3) The most interesting case here is self describing data (JSON, Avro,
>> Thrift, etc.)  Hive, Pig, and MR can all handle this, but I think HCatalog
>> could do a better job supporting this between the tools.
>>
>> >
>> > 1. Enable sharing of Hive table data between diverse tools.
>> > 2. Present users of these tools an abstraction that removes them from
>> > details of where and in what format data and metadata are stored.
>> > 6. Support data in all its forms in Hadoop. This includes structured,
>> > semi-structured, and unstructured data. It also includes handling
>> > schema transitions over time and HBase or Mongo like tables where each
>> > row can present a different set of fields
>> > 7. Provide a shared data type model across tools that includes the
>> > data types that users expect in modern SQL.
>> >
>> >
>> > INTERACTING WITH DATA
>> >
>> > What do you see the difference between 3 & 4 - they seem like the same
>> > thing, or very similar.
>> The difference here is in what the API is for.  3 is for tools that will
>> clean, archive, replicate, and compact the data.  These will need to access
>> all or most of the data and ask questions like, "what data should I be
>> replicating to another cluster right now?"  4 is for systems like external
>> data stores that want to push or pull data or push processing to Hadoop.
>>  For example consider connecting a multi-processor database to Hadoop and
>> allowing it to push down simple project/select queries and get the answers
>> in parallel streams of records.
>>
>> >
>> > 3. Provide APIs to enable tools that manage the lifecycle of data in
>> Hadoop.
>> > 4. Provide APIs to external systems and external users that allow them
>> > to interact efficiently with Hive table data in Hadoop. This includes
>> > creating, altering, removing, exploring, reading, and writing table
>> > data.
>> >
>> >
>> > PHYSICAL STORAGE LAYER:
>> >
>> > Some goals are about the physical storage layer. Can you elaborate on
>> > how this differs from HiveStorageHandler which already exists?
>>
>> I didn't mean to imply this was a different interface.  HiveStorageHandler
>> just needs beefing up and maturing.  It needs to be able to do alter table,
>> it needs implementations against more data stores, etc.
>>
>> >
>> > 5. Provide APIs to allow Hive and other HCatalog clients to
>> > transparently connect to external data stores and use them as Hive
>> > tables . (e.g. S3, HBase, or any database or NoSQL store could be used
>> > to store a Hive table)
>> > 9. Provide tables that can accept streams of records and/or row
>> > updates efficiently.
>> >
>> >
>> > OTHER
>> >
>> > The registry stuff sounds like its getting into processing-framework
>> > land, rather than being a table service. Being in the dependency
>> > management business sounds pretty painful. Is this something people
>> > are actually asking for? I understand having a server like webhcat
>> > that you submit a query to and it executes it (so you just manage the
>> > dependencies on that set of hosts), but having hooks to install code
>> > in the processing framework submitting queries doesn't sound like
>> > something I've heard people asking for.
>> People definitely want to be able to store their UDFs in a shared place.
>>  This is also true for code needed to access data (SerDes, IF/OF,
>> load/store functions).  In any commercial database a user can register a
>> UDF and have it stored for later use or for use by other users.
>>
>> I don't see this extending to dependency management though.  It would be
>> fine to allow the user to register a jar that is needed for a particular
>> UDF.  That jar must contain everything needed for that UDF; HCat won't
>> figure out which jars that jar needs.
>>
>> >
>> > 8. Embrace the Hive security model, while extending it to provided
>> > needed protection to Hadoop data accessed via any tool or UDF.
>> > 10. Provide a registry of SerDes, InputFormats and OutputFormats, and
>> > StorageHandlers that allow HCatalog clients to reference any Hive
>> > table in any storage format or on any data source without needing to
>> > install code on their machines.
>> > 11. Provide a Registry of UDFs and Table Functions that allow clients
>> > to utilize registered UDFs from compatible tools by invoking them by
>> > name.
>> >
>> >
>> >
>> > HOW DOES THIS AFFECT HIVE?
>> >
>> > One area we might expand on is how this roadmap affects Hive and its
>> > users. Not much changes, really. If anything, changes are typical
>> > engineering good practices, like having clear interfaces, separation
>> > between components, so some components can be reused by other query
>> > languages.
>> >
>> > Some areas of overlap are the Hive server & Hive web interfaces. How
>> > do you see these compared to webhcat? Would it make sense to
>> > potentially merge these into a single FE daemon that people use to
>> > interact with the cluster?
>>
>> Between HCat and Hive we have at least four servers (HiveServer, thrift
>> metastore, Hive web interfaces, and webhcat).  This is absurd.  I
>> definitely agree that we need to figure out how to rationalize these into
>> fewer servers.  I think we also need to think about where job management
>> services (like webhcat and Hive web interfaces have) go.  I've been clear
>> in the past that the job management services in webhcat are an historical
>> artifact and they need to move, probably to Hadoop core.  It makes sense to
>> me though that the DDL REST interfaces in webhcat become part of Hive and
>> are hosted by one of the existing servers rather than requiring yet another
>> server.
>>
>> >
>> >
>> > OTHER THOUGHTS
>> >
>> > Overall I agree with the roadmap, and probably like everyone has the
>> > parts I'm more interested in using than others. The parts "below" Hive
>> > (table handlers basically) seem more like "these are contributions to
>> > Hive that we're interested in making" rather than saying they're part
>> > of hcat. The package management stuff seems like its getting into the
>> > processing frameworks turf.
>>
>> I should have made clear I was thinking of this as a roadmap for the HCat
>> community, not necessarily the HCat code.  Due to our tight integration
>> with Hive some of this will be accomplished in Hive.
>>
>> >
>> > Do we see a goal of HCat as storing metadata about who's using the
>> > data? Currently its only keeping track of metadata about the data
>> > itself, not who's using it. We already using a complementary system
>> > that keeps that data, and Jakob mentioned interest in this too. I'm
>> > curious to hear if others are interested in this too.
>>
>> This is an open question.  Clearly Hadoop needs job metadata.  Whether
>> it's best to combine that with table metadata or implement it in a separate
>> system, as you have done, is unclear to me.  I lean towards agreeing with
>> your implementation that separation here is best.  But I suspect many users
>> would like one project to solve both, rather than having to deploy HCatalog
>> and a separate job metadata service.  We could put an item in the roadmap
>> noting that we need to consider this.
>>
>> Alan.
>>
>> >
>> >
>> > --travis
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Oct 12, 2012 at 2:59 PM, Alan Gates <[email protected]>
>> wrote:
>> >> In thinking about approaching the Hive community to accept HCatalog as
>> a subproject one thing that came out is it would be useful to have a
>> roadmap for HCatalog.  This will help the Hive community understand what
>> HCatalog is and plans to be.
>> >>
>> >> It is also good for us as a community to discuss this.  And it is good
>> for potential contributors and users to understand what we want HCatalog to
>> be.  Pig did this when it was a young project and it helped us tremendously.
>> >>
>> >> I have published a proposed roadmap at
>> https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+Roadmap Please 
>> provide feedback on items I have put there plus any you believe
>> that should be added.
>> >>
>> >> Alan.
>>
>>

Re: [DISCUSS] Proposed HCatalog roadmap

Reply via email to