Re: [DISCUSS] Proposed HCatalog roadmap

Rohini Palaniswamy Tue, 23 Oct 2012 11:33:36 -0700

Thanks Travis. Reply inline inline. Had forgotten to hit the send button on
this one.


On Tue, Oct 16, 2012 at 8:41 AM, Travis Crawford
<[email protected]>wrote:

> Very good list Rohini. Since the thread is getting a bit long I
> consolidated the suggestions so far in a Wishlist section of the wiki
> page, so we don't lose track of them.
>
>
> https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+Roadmap#HCatalogRoadmap-Wishlist
>
> Some comments inline:
>
>
> On Mon, Oct 15, 2012 at 10:44 AM, Rohini Palaniswamy
> <[email protected]> wrote:
> > Alan,
> >
> >   The proposed roadmap looks good. I especially like the idea of having a
> > registry so that every user does not have to try and find the required
> > correct serde jars as is happening with pig loaders/storer jars today.
> Also
> > +1 on hosting thrift + webhcat servers in one jvm.
> >
> >   But there are few other things that would be good to have in the
> roadmap.
> > Have listed them below.
> >
> > Multiple cluster capabilities:
> >    * Process data from one cluster and store into another cluster. i.e
> > HCatLoader() reads from one cluster's hcat server and HCatStorer() writes
> > to another hcat server. This is something that we would be needing very
> > soon.
>
> This is something we'll need too. Related to your export/import
> comments, this is the direction I hope to take our data replicating
> tools in, rather than separate distcp for the data, and metadata
> import/export. A copy tool would have some policy about what to pull
> from remote clusters, would scan or be notified of new partitions to
> copy, and kicks of an HCat MR job to pull the data over. An added
> benefit is data provenance info can be recorded to show the full
> history for a particular analysis. Is this approach one that sounds
> good to you, or is there a particular reason you want to copy data &
> metadata separately?


>
 Rohini:  You are thinking on the lines of Goal 5 that Alan had mentioned
> about data lifecycle management tools and replication/BCP. In that case you
> need to backup both metadata and data or rather copy data to another hdfs
> and metadata to another hcat server. But there will be cases where option
> to separately backup(export) metadata is useful like when data is managed
> externally (external tables). Another case is archival of data. You could
> be mounting your hdfs through fuse and archiving data directly to tape, but
> you will also need to get the metadata and store it along with archived
> data so that during restore you can restore the state in the hcat server.
>

>
 >    * HCat Metastore server to handle multiple cluster metadata. We would
> > eventually like to have one hcat server per colo.
>
> Didn't the Hive folks look at this at some point, then abandon it. Any
> context here?
>
> Rohini: Not familiar with the history. It would be good to know why they
> abandoned the idea. But in general you will have to do lot of classloader
> magic to separate out different versions of hadoop jars. It is painful but
> feasible though and we have a product in production that interacts with all
> the clusters that we have.
>

> >
> > Export/Import:
> >    * Ability to import/export metadata out of hcat. Some level of support
> > for this was added in hcat 0.2 but currently broken. This can be useful
> for
> > user backing up their hcat server metadata, ability to import into
> another
> > hcat metastore server, etc instead of having to replay add/alter
> > partitions. For eg: We would like to move a project from one cluster to
> > another and it has 1+ years worth of data in hcat server. Copying the
> data
> > can be easily done with distcp by copying the toplevel table directory
> but
> > copying the metadata is going to be more cumbersome.
> >
> > Data Provenance/Lineage and Data Discovery:
> >    * Ability to store data provenance/lineage apart from statistics on
> the
> > data.
> >    * Ability to discover data. For eg: If a new user needs to know where
> > click data for search ads is stored, he needs to go through twikis or
> mail
> > userlists to find where exactly it is stored. Need the ability to query
> on
> > keywords/producer of data and find which table contains the data.
>
> This would be interesting to discuss further in the future. We're
> already doing some of this, but its certainly something any HCat user
> would be interested in.
>
> --travis
>
>
>
> > Regards,
> > Rohini
> >
> > On Mon, Oct 15, 2012 at 8:27 AM, Alan Gates <[email protected]>
> wrote:
> >
> >> Travis,
> >>
> >> Thanks for the thought provoking feedback.  Comments inline.
> >>
> >>
> >> On Oct 13, 2012, at 9:55 AM, Travis Crawford wrote:
> >>
> >> > Hey Alan -
> >> >
> >> > Thanks for putting this roadmap together, its definitely well thought
> >> > out and appreciated that you took the time to do this. Some
> >> > questions/comments below:
> >> >
> >> >
> >> > METADATA
> >> >
> >> > These seem like the key items that make hcat attractive – shared table
> >> > service & read/write path for all processing frameworks. Something I'm
> >> > curious about are your thoughts for dealing with
> >> > unstructured/semi-structured data. So far I've personally considered
> >> > these outside the scope of hcat because the filesystem already does
> >> > that well. The table service really helps when you know stuff about
> >> > your data. But if you don't know anything about it, why put it in the
> >> > table service?
> >>
> >> A few thoughts:
> >> 1) There's a range between relationally structured data and data I know
> >> nothing about.  I may know some things about the data (most rows have a
> >> user column) without knowing everything.
> >> 2) Schema knowledge is not the only value of a table service.  Shared
> >> access paradigms and data models are also valuable.  So even for data
> where
> >> the schema is not known to the certainty you would expect in a
> relational
> >> store there is value in having a single access paradigm between tools.
> >> 3) The most interesting case here is self describing data (JSON, Avro,
> >> Thrift, etc.)  Hive, Pig, and MR can all handle this, but I think
> HCatalog
> >> could do a better job supporting this between the tools.
> >>
> >> >
> >> > 1. Enable sharing of Hive table data between diverse tools.
> >> > 2. Present users of these tools an abstraction that removes them from
> >> > details of where and in what format data and metadata are stored.
> >> > 6. Support data in all its forms in Hadoop. This includes structured,
> >> > semi-structured, and unstructured data. It also includes handling
> >> > schema transitions over time and HBase or Mongo like tables where each
> >> > row can present a different set of fields
> >> > 7. Provide a shared data type model across tools that includes the
> >> > data types that users expect in modern SQL.
> >> >
> >> >
> >> > INTERACTING WITH DATA
> >> >
> >> > What do you see the difference between 3 & 4 - they seem like the same
> >> > thing, or very similar.
> >> The difference here is in what the API is for.  3 is for tools that will
> >> clean, archive, replicate, and compact the data.  These will need to
> access
> >> all or most of the data and ask questions like, "what data should I be
> >> replicating to another cluster right now?"  4 is for systems like
> external
> >> data stores that want to push or pull data or push processing to Hadoop.
> >>  For example consider connecting a multi-processor database to Hadoop
> and
> >> allowing it to push down simple project/select queries and get the
> answers
> >> in parallel streams of records.
> >>
> >> >
> >> > 3. Provide APIs to enable tools that manage the lifecycle of data in
> >> Hadoop.
> >> > 4. Provide APIs to external systems and external users that allow them
> >> > to interact efficiently with Hive table data in Hadoop. This includes
> >> > creating, altering, removing, exploring, reading, and writing table
> >> > data.
> >> >
> >> >
> >> > PHYSICAL STORAGE LAYER:
> >> >
> >> > Some goals are about the physical storage layer. Can you elaborate on
> >> > how this differs from HiveStorageHandler which already exists?
> >>
> >> I didn't mean to imply this was a different interface.
>  HiveStorageHandler
> >> just needs beefing up and maturing.  It needs to be able to do alter
> table,
> >> it needs implementations against more data stores, etc.
> >>
> >> >
> >> > 5. Provide APIs to allow Hive and other HCatalog clients to
> >> > transparently connect to external data stores and use them as Hive
> >> > tables . (e.g. S3, HBase, or any database or NoSQL store could be used
> >> > to store a Hive table)
> >> > 9. Provide tables that can accept streams of records and/or row
> >> > updates efficiently.
> >> >
> >> >
> >> > OTHER
> >> >
> >> > The registry stuff sounds like its getting into processing-framework
> >> > land, rather than being a table service. Being in the dependency
> >> > management business sounds pretty painful. Is this something people
> >> > are actually asking for? I understand having a server like webhcat
> >> > that you submit a query to and it executes it (so you just manage the
> >> > dependencies on that set of hosts), but having hooks to install code
> >> > in the processing framework submitting queries doesn't sound like
> >> > something I've heard people asking for.
> >> People definitely want to be able to store their UDFs in a shared place.
> >>  This is also true for code needed to access data (SerDes, IF/OF,
> >> load/store functions).  In any commercial database a user can register a
> >> UDF and have it stored for later use or for use by other users.
> >>
> >> I don't see this extending to dependency management though.  It would be
> >> fine to allow the user to register a jar that is needed for a particular
> >> UDF.  That jar must contain everything needed for that UDF; HCat won't
> >> figure out which jars that jar needs.
> >>
> >> >
> >> > 8. Embrace the Hive security model, while extending it to provided
> >> > needed protection to Hadoop data accessed via any tool or UDF.
> >> > 10. Provide a registry of SerDes, InputFormats and OutputFormats, and
> >> > StorageHandlers that allow HCatalog clients to reference any Hive
> >> > table in any storage format or on any data source without needing to
> >> > install code on their machines.
> >> > 11. Provide a Registry of UDFs and Table Functions that allow clients
> >> > to utilize registered UDFs from compatible tools by invoking them by
> >> > name.
> >> >
> >> >
> >> >
> >> > HOW DOES THIS AFFECT HIVE?
> >> >
> >> > One area we might expand on is how this roadmap affects Hive and its
> >> > users. Not much changes, really. If anything, changes are typical
> >> > engineering good practices, like having clear interfaces, separation
> >> > between components, so some components can be reused by other query
> >> > languages.
> >> >
> >> > Some areas of overlap are the Hive server & Hive web interfaces. How
> >> > do you see these compared to webhcat? Would it make sense to
> >> > potentially merge these into a single FE daemon that people use to
> >> > interact with the cluster?
> >>
> >> Between HCat and Hive we have at least four servers (HiveServer, thrift
> >> metastore, Hive web interfaces, and webhcat).  This is absurd.  I
> >> definitely agree that we need to figure out how to rationalize these
> into
> >> fewer servers.  I think we also need to think about where job management
> >> services (like webhcat and Hive web interfaces have) go.  I've been
> clear
> >> in the past that the job management services in webhcat are an
> historical
> >> artifact and they need to move, probably to Hadoop core.  It makes
> sense to
> >> me though that the DDL REST interfaces in webhcat become part of Hive
> and
> >> are hosted by one of the existing servers rather than requiring yet
> another
> >> server.
> >>
> >> >
> >> >
> >> > OTHER THOUGHTS
> >> >
> >> > Overall I agree with the roadmap, and probably like everyone has the
> >> > parts I'm more interested in using than others. The parts "below" Hive
> >> > (table handlers basically) seem more like "these are contributions to
> >> > Hive that we're interested in making" rather than saying they're part
> >> > of hcat. The package management stuff seems like its getting into the
> >> > processing frameworks turf.
> >>
> >> I should have made clear I was thinking of this as a roadmap for the
> HCat
> >> community, not necessarily the HCat code.  Due to our tight integration
> >> with Hive some of this will be accomplished in Hive.
> >>
> >> >
> >> > Do we see a goal of HCat as storing metadata about who's using the
> >> > data? Currently its only keeping track of metadata about the data
> >> > itself, not who's using it. We already using a complementary system
> >> > that keeps that data, and Jakob mentioned interest in this too. I'm
> >> > curious to hear if others are interested in this too.
> >>
> >> This is an open question.  Clearly Hadoop needs job metadata.  Whether
> >> it's best to combine that with table metadata or implement it in a
> separate
> >> system, as you have done, is unclear to me.  I lean towards agreeing
> with
> >> your implementation that separation here is best.  But I suspect many
> users
> >> would like one project to solve both, rather than having to deploy
> HCatalog
> >> and a separate job metadata service.  We could put an item in the
> roadmap
> >> noting that we need to consider this.
> >>
> >> Alan.
> >>
> >> >
> >> >
> >> > --travis
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Oct 12, 2012 at 2:59 PM, Alan Gates <[email protected]>
> >> wrote:
> >> >> In thinking about approaching the Hive community to accept HCatalog
> as
> >> a subproject one thing that came out is it would be useful to have a
> >> roadmap for HCatalog.  This will help the Hive community understand what
> >> HCatalog is and plans to be.
> >> >>
> >> >> It is also good for us as a community to discuss this.  And it is
> good
> >> for potential contributors and users to understand what we want
> HCatalog to
> >> be.  Pig did this when it was a young project and it helped us
> tremendously.
> >> >>
> >> >> I have published a proposed roadmap at
> >> https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+RoadmapPlease
> >>  provide feedback on items I have put there plus any you believe
> >> that should be added.
> >> >>
> >> >> Alan.
> >>
> >>
>

Re: [DISCUSS] Proposed HCatalog roadmap

Reply via email to