Re: [DISCUSS] Proposed HCatalog roadmap

Arup Malakar Mon, 15 Oct 2012 14:53:14 -0700

Hi Alan,

The roadmap looks good, and thanks for laying out the vision in a well
thought structure.


Few comments:

HCatalog's Goals:
> 5. Provide APIs to allow Hive and other HCatalog clients to transparently
> connect to external data stores and use them as Hive tables . (e.g. S3,
> HBase, or any database or NoSQL store could be used to store a Hive table)


  * Does it mean the ability to store Hive tables on external data stores
other than in HDFS or is it the ability to use already *existing data* in
external systems via HCatalog?

  * +1 on federated HCatalog as suggested by Rohini.

  * HCatalog currently seems to have the functionality to be able to
publish events(data availability/creation of table etc) via JMS. Is there
any thought on strengthening the API and adding other methods of publishing
events? Or in short making it a core feature as I didn't see in mentioned
in the goals.

  * Regarding the registry for UDFs and table function, won't maven (or a
similar system) outside of HCatalog let user share UDFs or code require to
access data? Doing it on  HCatalog's part seems a bit stepping outside the
boundary.

Thanks,
Arup


On Mon, Oct 15, 2012 at 10:44 AM, Rohini Palaniswamy <
[email protected]> wrote:

> Alan,
>
>   The proposed roadmap looks good. I especially like the idea of having a
> registry so that every user does not have to try and find the required
> correct serde jars as is happening with pig loaders/storer jars today. Also
> +1 on hosting thrift + webhcat servers in one jvm.
>
>   But there are few other things that would be good to have in the roadmap.
> Have listed them below.
>
> Multiple cluster capabilities:
>    * Process data from one cluster and store into another cluster. i.e
> HCatLoader() reads from one cluster's hcat server and HCatStorer() writes
> to another hcat server. This is something that we would be needing very
> soon.
>    * HCat Metastore server to handle multiple cluster metadata. We would
> eventually like to have one hcat server per colo.
>
> Export/Import:
>    * Ability to import/export metadata out of hcat. Some level of support
> for this was added in hcat 0.2 but currently broken. This can be useful for
> user backing up their hcat server metadata, ability to import into another
> hcat metastore server, etc instead of having to replay add/alter
> partitions. For eg: We would like to move a project from one cluster to
> another and it has 1+ years worth of data in hcat server. Copying the data
> can be easily done with distcp by copying the toplevel table directory but
> copying the metadata is going to be more cumbersome.
>
> Data Provenance/Lineage and Data Discovery:
>    * Ability to store data provenance/lineage apart from statistics on the
> data.
>    * Ability to discover data. For eg: If a new user needs to know where
> click data for search ads is stored, he needs to go through twikis or mail
> userlists to find where exactly it is stored. Need the ability to query on
> keywords/producer of data and find which table contains the data.
>
> Regards,
> Rohini
>
> On Mon, Oct 15, 2012 at 8:27 AM, Alan Gates <[email protected]> wrote:
>
> > Travis,
> >
> > Thanks for the thought provoking feedback.  Comments inline.
> >
> >
> > On Oct 13, 2012, at 9:55 AM, Travis Crawford wrote:
> >
> > > Hey Alan -
> > >
> > > Thanks for putting this roadmap together, its definitely well thought
> > > out and appreciated that you took the time to do this. Some
> > > questions/comments below:
> > >
> > >
> > > METADATA
> > >
> > > These seem like the key items that make hcat attractive – shared table
> > > service & read/write path for all processing frameworks. Something I'm
> > > curious about are your thoughts for dealing with
> > > unstructured/semi-structured data. So far I've personally considered
> > > these outside the scope of hcat because the filesystem already does
> > > that well. The table service really helps when you know stuff about
> > > your data. But if you don't know anything about it, why put it in the
> > > table service?
> >
> > A few thoughts:
> > 1) There's a range between relationally structured data and data I know
> > nothing about.  I may know some things about the data (most rows have a
> > user column) without knowing everything.
> > 2) Schema knowledge is not the only value of a table service.  Shared
> > access paradigms and data models are also valuable.  So even for data
> where
> > the schema is not known to the certainty you would expect in a relational
> > store there is value in having a single access paradigm between tools.
> > 3) The most interesting case here is self describing data (JSON, Avro,
> > Thrift, etc.)  Hive, Pig, and MR can all handle this, but I think
> HCatalog
> > could do a better job supporting this between the tools.
> >
> > >
> > > 1. Enable sharing of Hive table data between diverse tools.
> > > 2. Present users of these tools an abstraction that removes them from
> > > details of where and in what format data and metadata are stored.
> > > 6. Support data in all its forms in Hadoop. This includes structured,
> > > semi-structured, and unstructured data. It also includes handling
> > > schema transitions over time and HBase or Mongo like tables where each
> > > row can present a different set of fields
> > > 7. Provide a shared data type model across tools that includes the
> > > data types that users expect in modern SQL.
> > >
> > >
> > > INTERACTING WITH DATA
> > >
> > > What do you see the difference between 3 & 4 - they seem like the same
> > > thing, or very similar.
> > The difference here is in what the API is for.  3 is for tools that will
> > clean, archive, replicate, and compact the data.  These will need to
> access
> > all or most of the data and ask questions like, "what data should I be
> > replicating to another cluster right now?"  4 is for systems like
> external
> > data stores that want to push or pull data or push processing to Hadoop.
> >  For example consider connecting a multi-processor database to Hadoop and
> > allowing it to push down simple project/select queries and get the
> answers
> > in parallel streams of records.
> >
> > >
> > > 3. Provide APIs to enable tools that manage the lifecycle of data in
> > Hadoop.
> > > 4. Provide APIs to external systems and external users that allow them
> > > to interact efficiently with Hive table data in Hadoop. This includes
> > > creating, altering, removing, exploring, reading, and writing table
> > > data.
> > >
> > >
> > > PHYSICAL STORAGE LAYER:
> > >
> > > Some goals are about the physical storage layer. Can you elaborate on
> > > how this differs from HiveStorageHandler which already exists?
> >
> > I didn't mean to imply this was a different interface.
>  HiveStorageHandler
> > just needs beefing up and maturing.  It needs to be able to do alter
> table,
> > it needs implementations against more data stores, etc.
> >
> > >
> > > 5. Provide APIs to allow Hive and other HCatalog clients to
> > > transparently connect to external data stores and use them as Hive
> > > tables . (e.g. S3, HBase, or any database or NoSQL store could be used
> > > to store a Hive table)
> > > 9. Provide tables that can accept streams of records and/or row
> > > updates efficiently.
> > >
> > >
> > > OTHER
> > >
> > > The registry stuff sounds like its getting into processing-framework
> > > land, rather than being a table service. Being in the dependency
> > > management business sounds pretty painful. Is this something people
> > > are actually asking for? I understand having a server like webhcat
> > > that you submit a query to and it executes it (so you just manage the
> > > dependencies on that set of hosts), but having hooks to install code
> > > in the processing framework submitting queries doesn't sound like
> > > something I've heard people asking for.
> > People definitely want to be able to store their UDFs in a shared place.
> >  This is also true for code needed to access data (SerDes, IF/OF,
> > load/store functions).  In any commercial database a user can register a
> > UDF and have it stored for later use or for use by other users.
> >
> > I don't see this extending to dependency management though.  It would be
> > fine to allow the user to register a jar that is needed for a particular
> > UDF.  That jar must contain everything needed for that UDF; HCat won't
> > figure out which jars that jar needs.
> >
> > >
> > > 8. Embrace the Hive security model, while extending it to provided
> > > needed protection to Hadoop data accessed via any tool or UDF.
> > > 10. Provide a registry of SerDes, InputFormats and OutputFormats, and
> > > StorageHandlers that allow HCatalog clients to reference any Hive
> > > table in any storage format or on any data source without needing to
> > > install code on their machines.
> > > 11. Provide a Registry of UDFs and Table Functions that allow clients
> > > to utilize registered UDFs from compatible tools by invoking them by
> > > name.
> > >
> > >
> > >
> > > HOW DOES THIS AFFECT HIVE?
> > >
> > > One area we might expand on is how this roadmap affects Hive and its
> > > users. Not much changes, really. If anything, changes are typical
> > > engineering good practices, like having clear interfaces, separation
> > > between components, so some components can be reused by other query
> > > languages.
> > >
> > > Some areas of overlap are the Hive server & Hive web interfaces. How
> > > do you see these compared to webhcat? Would it make sense to
> > > potentially merge these into a single FE daemon that people use to
> > > interact with the cluster?
> >
> > Between HCat and Hive we have at least four servers (HiveServer, thrift
> > metastore, Hive web interfaces, and webhcat).  This is absurd.  I
> > definitely agree that we need to figure out how to rationalize these into
> > fewer servers.  I think we also need to think about where job management
> > services (like webhcat and Hive web interfaces have) go.  I've been clear
> > in the past that the job management services in webhcat are an historical
> > artifact and they need to move, probably to Hadoop core.  It makes sense
> to
> > me though that the DDL REST interfaces in webhcat become part of Hive and
> > are hosted by one of the existing servers rather than requiring yet
> another
> > server.
> >
> > >
> > >
> > > OTHER THOUGHTS
> > >
> > > Overall I agree with the roadmap, and probably like everyone has the
> > > parts I'm more interested in using than others. The parts "below" Hive
> > > (table handlers basically) seem more like "these are contributions to
> > > Hive that we're interested in making" rather than saying they're part
> > > of hcat. The package management stuff seems like its getting into the
> > > processing frameworks turf.
> >
> > I should have made clear I was thinking of this as a roadmap for the HCat
> > community, not necessarily the HCat code.  Due to our tight integration
> > with Hive some of this will be accomplished in Hive.
> >
> > >
> > > Do we see a goal of HCat as storing metadata about who's using the
> > > data? Currently its only keeping track of metadata about the data
> > > itself, not who's using it. We already using a complementary system
> > > that keeps that data, and Jakob mentioned interest in this too. I'm
> > > curious to hear if others are interested in this too.
> >
> > This is an open question.  Clearly Hadoop needs job metadata.  Whether
> > it's best to combine that with table metadata or implement it in a
> separate
> > system, as you have done, is unclear to me.  I lean towards agreeing with
> > your implementation that separation here is best.  But I suspect many
> users
> > would like one project to solve both, rather than having to deploy
> HCatalog
> > and a separate job metadata service.  We could put an item in the roadmap
> > noting that we need to consider this.
> >
> > Alan.
> >
> > >
> > >
> > > --travis
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Oct 12, 2012 at 2:59 PM, Alan Gates <[email protected]>
> > wrote:
> > >> In thinking about approaching the Hive community to accept HCatalog as
> > a subproject one thing that came out is it would be useful to have a
> > roadmap for HCatalog.  This will help the Hive community understand what
> > HCatalog is and plans to be.
> > >>
> > >> It is also good for us as a community to discuss this.  And it is good
> > for potential contributors and users to understand what we want HCatalog
> to
> > be.  Pig did this when it was a young project and it helped us
> tremendously.
> > >>
> > >> I have published a proposed roadmap at
> > https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+RoadmapPlease 
> > provide feedback on items I have put there plus any you believe
> > that should be added.
> > >>
> > >> Alan.
> >
> >
>



-- 
Arup Malakar

Re: [DISCUSS] Proposed HCatalog roadmap

Reply via email to