Travis, Thanks for the thought provoking feedback. Comments inline.
On Oct 13, 2012, at 9:55 AM, Travis Crawford wrote: > Hey Alan - > > Thanks for putting this roadmap together, its definitely well thought > out and appreciated that you took the time to do this. Some > questions/comments below: > > > METADATA > > These seem like the key items that make hcat attractive – shared table > service & read/write path for all processing frameworks. Something I'm > curious about are your thoughts for dealing with > unstructured/semi-structured data. So far I've personally considered > these outside the scope of hcat because the filesystem already does > that well. The table service really helps when you know stuff about > your data. But if you don't know anything about it, why put it in the > table service? A few thoughts: 1) There's a range between relationally structured data and data I know nothing about. I may know some things about the data (most rows have a user column) without knowing everything. 2) Schema knowledge is not the only value of a table service. Shared access paradigms and data models are also valuable. So even for data where the schema is not known to the certainty you would expect in a relational store there is value in having a single access paradigm between tools. 3) The most interesting case here is self describing data (JSON, Avro, Thrift, etc.) Hive, Pig, and MR can all handle this, but I think HCatalog could do a better job supporting this between the tools. > > 1. Enable sharing of Hive table data between diverse tools. > 2. Present users of these tools an abstraction that removes them from > details of where and in what format data and metadata are stored. > 6. Support data in all its forms in Hadoop. This includes structured, > semi-structured, and unstructured data. It also includes handling > schema transitions over time and HBase or Mongo like tables where each > row can present a different set of fields > 7. Provide a shared data type model across tools that includes the > data types that users expect in modern SQL. > > > INTERACTING WITH DATA > > What do you see the difference between 3 & 4 - they seem like the same > thing, or very similar. The difference here is in what the API is for. 3 is for tools that will clean, archive, replicate, and compact the data. These will need to access all or most of the data and ask questions like, "what data should I be replicating to another cluster right now?" 4 is for systems like external data stores that want to push or pull data or push processing to Hadoop. For example consider connecting a multi-processor database to Hadoop and allowing it to push down simple project/select queries and get the answers in parallel streams of records. > > 3. Provide APIs to enable tools that manage the lifecycle of data in Hadoop. > 4. Provide APIs to external systems and external users that allow them > to interact efficiently with Hive table data in Hadoop. This includes > creating, altering, removing, exploring, reading, and writing table > data. > > > PHYSICAL STORAGE LAYER: > > Some goals are about the physical storage layer. Can you elaborate on > how this differs from HiveStorageHandler which already exists? I didn't mean to imply this was a different interface. HiveStorageHandler just needs beefing up and maturing. It needs to be able to do alter table, it needs implementations against more data stores, etc. > > 5. Provide APIs to allow Hive and other HCatalog clients to > transparently connect to external data stores and use them as Hive > tables . (e.g. S3, HBase, or any database or NoSQL store could be used > to store a Hive table) > 9. Provide tables that can accept streams of records and/or row > updates efficiently. > > > OTHER > > The registry stuff sounds like its getting into processing-framework > land, rather than being a table service. Being in the dependency > management business sounds pretty painful. Is this something people > are actually asking for? I understand having a server like webhcat > that you submit a query to and it executes it (so you just manage the > dependencies on that set of hosts), but having hooks to install code > in the processing framework submitting queries doesn't sound like > something I've heard people asking for. People definitely want to be able to store their UDFs in a shared place. This is also true for code needed to access data (SerDes, IF/OF, load/store functions). In any commercial database a user can register a UDF and have it stored for later use or for use by other users. I don't see this extending to dependency management though. It would be fine to allow the user to register a jar that is needed for a particular UDF. That jar must contain everything needed for that UDF; HCat won't figure out which jars that jar needs. > > 8. Embrace the Hive security model, while extending it to provided > needed protection to Hadoop data accessed via any tool or UDF. > 10. Provide a registry of SerDes, InputFormats and OutputFormats, and > StorageHandlers that allow HCatalog clients to reference any Hive > table in any storage format or on any data source without needing to > install code on their machines. > 11. Provide a Registry of UDFs and Table Functions that allow clients > to utilize registered UDFs from compatible tools by invoking them by > name. > > > > HOW DOES THIS AFFECT HIVE? > > One area we might expand on is how this roadmap affects Hive and its > users. Not much changes, really. If anything, changes are typical > engineering good practices, like having clear interfaces, separation > between components, so some components can be reused by other query > languages. > > Some areas of overlap are the Hive server & Hive web interfaces. How > do you see these compared to webhcat? Would it make sense to > potentially merge these into a single FE daemon that people use to > interact with the cluster? Between HCat and Hive we have at least four servers (HiveServer, thrift metastore, Hive web interfaces, and webhcat). This is absurd. I definitely agree that we need to figure out how to rationalize these into fewer servers. I think we also need to think about where job management services (like webhcat and Hive web interfaces have) go. I've been clear in the past that the job management services in webhcat are an historical artifact and they need to move, probably to Hadoop core. It makes sense to me though that the DDL REST interfaces in webhcat become part of Hive and are hosted by one of the existing servers rather than requiring yet another server. > > > OTHER THOUGHTS > > Overall I agree with the roadmap, and probably like everyone has the > parts I'm more interested in using than others. The parts "below" Hive > (table handlers basically) seem more like "these are contributions to > Hive that we're interested in making" rather than saying they're part > of hcat. The package management stuff seems like its getting into the > processing frameworks turf. I should have made clear I was thinking of this as a roadmap for the HCat community, not necessarily the HCat code. Due to our tight integration with Hive some of this will be accomplished in Hive. > > Do we see a goal of HCat as storing metadata about who's using the > data? Currently its only keeping track of metadata about the data > itself, not who's using it. We already using a complementary system > that keeps that data, and Jakob mentioned interest in this too. I'm > curious to hear if others are interested in this too. This is an open question. Clearly Hadoop needs job metadata. Whether it's best to combine that with table metadata or implement it in a separate system, as you have done, is unclear to me. I lean towards agreeing with your implementation that separation here is best. But I suspect many users would like one project to solve both, rather than having to deploy HCatalog and a separate job metadata service. We could put an item in the roadmap noting that we need to consider this. Alan. > > > --travis > > > > > > On Fri, Oct 12, 2012 at 2:59 PM, Alan Gates <[email protected]> wrote: >> In thinking about approaching the Hive community to accept HCatalog as a >> subproject one thing that came out is it would be useful to have a roadmap >> for HCatalog. This will help the Hive community understand what HCatalog is >> and plans to be. >> >> It is also good for us as a community to discuss this. And it is good for >> potential contributors and users to understand what we want HCatalog to be. >> Pig did this when it was a young project and it helped us tremendously. >> >> I have published a proposed roadmap at >> https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+Roadmap >> Please provide feedback on items I have put there plus any you believe that >> should be added. >> >> Alan.
