Hey Alan - Thanks for putting this roadmap together, its definitely well thought out and appreciated that you took the time to do this. Some questions/comments below:
METADATA These seem like the key items that make hcat attractive – shared table service & read/write path for all processing frameworks. Something I'm curious about are your thoughts for dealing with unstructured/semi-structured data. So far I've personally considered these outside the scope of hcat because the filesystem already does that well. The table service really helps when you know stuff about your data. But if you don't know anything about it, why put it in the table service? 1. Enable sharing of Hive table data between diverse tools. 2. Present users of these tools an abstraction that removes them from details of where and in what format data and metadata are stored. 6. Support data in all its forms in Hadoop. This includes structured, semi-structured, and unstructured data. It also includes handling schema transitions over time and HBase or Mongo like tables where each row can present a different set of fields 7. Provide a shared data type model across tools that includes the data types that users expect in modern SQL. INTERACTING WITH DATA What do you see the difference between 3 & 4 - they seem like the same thing, or very similar. 3. Provide APIs to enable tools that manage the lifecycle of data in Hadoop. 4. Provide APIs to external systems and external users that allow them to interact efficiently with Hive table data in Hadoop. This includes creating, altering, removing, exploring, reading, and writing table data. PHYSICAL STORAGE LAYER: Some goals are about the physical storage layer. Can you elaborate on how this differs from HiveStorageHandler which already exists? 5. Provide APIs to allow Hive and other HCatalog clients to transparently connect to external data stores and use them as Hive tables . (e.g. S3, HBase, or any database or NoSQL store could be used to store a Hive table) 9. Provide tables that can accept streams of records and/or row updates efficiently. OTHER The registry stuff sounds like its getting into processing-framework land, rather than being a table service. Being in the dependency management business sounds pretty painful. Is this something people are actually asking for? I understand having a server like webhcat that you submit a query to and it executes it (so you just manage the dependencies on that set of hosts), but having hooks to install code in the processing framework submitting queries doesn't sound like something I've heard people asking for. 8. Embrace the Hive security model, while extending it to provided needed protection to Hadoop data accessed via any tool or UDF. 10. Provide a registry of SerDes, InputFormats and OutputFormats, and StorageHandlers that allow HCatalog clients to reference any Hive table in any storage format or on any data source without needing to install code on their machines. 11. Provide a Registry of UDFs and Table Functions that allow clients to utilize registered UDFs from compatible tools by invoking them by name. HOW DOES THIS AFFECT HIVE? One area we might expand on is how this roadmap affects Hive and its users. Not much changes, really. If anything, changes are typical engineering good practices, like having clear interfaces, separation between components, so some components can be reused by other query languages. Some areas of overlap are the Hive server & Hive web interfaces. How do you see these compared to webhcat? Would it make sense to potentially merge these into a single FE daemon that people use to interact with the cluster? OTHER THOUGHTS Overall I agree with the roadmap, and probably like everyone has the parts I'm more interested in using than others. The parts "below" Hive (table handlers basically) seem more like "these are contributions to Hive that we're interested in making" rather than saying they're part of hcat. The package management stuff seems like its getting into the processing frameworks turf. Do we see a goal of HCat as storing metadata about who's using the data? Currently its only keeping track of metadata about the data itself, not who's using it. We already using a complementary system that keeps that data, and Jakob mentioned interest in this too. I'm curious to hear if others are interested in this too. --travis On Fri, Oct 12, 2012 at 2:59 PM, Alan Gates <[email protected]> wrote: > In thinking about approaching the Hive community to accept HCatalog as a > subproject one thing that came out is it would be useful to have a roadmap > for HCatalog. This will help the Hive community understand what HCatalog is > and plans to be. > > It is also good for us as a community to discuss this. And it is good for > potential contributors and users to understand what we want HCatalog to be. > Pig did this when it was a young project and it helped us tremendously. > > I have published a proposed roadmap at > https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+Roadmap Please > provide feedback on items I have put there plus any you believe that should > be added. > > Alan.
