Hi Alan, The roadmap looks good, and thanks for laying out the vision in a well thought structure.
Few comments: HCatalog's Goals: > 5. Provide APIs to allow Hive and other HCatalog clients to transparently > connect to external data stores and use them as Hive tables . (e.g. S3, > HBase, or any database or NoSQL store could be used to store a Hive table) * Does it mean the ability to store Hive tables on external data stores other than in HDFS or is it the ability to use already *existing data* in external systems via HCatalog? * +1 on federated HCatalog as suggested by Rohini. * HCatalog currently seems to have the functionality to be able to publish events(data availability/creation of table etc) via JMS. Is there any thought on strengthening the API and adding other methods of publishing events? Or in short making it a core feature as I didn't see in mentioned in the goals. * Regarding the registry for UDFs and table function, won't maven (or a similar system) outside of HCatalog let user share UDFs or code require to access data? Doing it on HCatalog's part seems a bit stepping outside the boundary. Thanks, Arup On Mon, Oct 15, 2012 at 10:44 AM, Rohini Palaniswamy < [email protected]> wrote: > Alan, > > The proposed roadmap looks good. I especially like the idea of having a > registry so that every user does not have to try and find the required > correct serde jars as is happening with pig loaders/storer jars today. Also > +1 on hosting thrift + webhcat servers in one jvm. > > But there are few other things that would be good to have in the roadmap. > Have listed them below. > > Multiple cluster capabilities: > * Process data from one cluster and store into another cluster. i.e > HCatLoader() reads from one cluster's hcat server and HCatStorer() writes > to another hcat server. This is something that we would be needing very > soon. > * HCat Metastore server to handle multiple cluster metadata. We would > eventually like to have one hcat server per colo. > > Export/Import: > * Ability to import/export metadata out of hcat. Some level of support > for this was added in hcat 0.2 but currently broken. This can be useful for > user backing up their hcat server metadata, ability to import into another > hcat metastore server, etc instead of having to replay add/alter > partitions. For eg: We would like to move a project from one cluster to > another and it has 1+ years worth of data in hcat server. Copying the data > can be easily done with distcp by copying the toplevel table directory but > copying the metadata is going to be more cumbersome. > > Data Provenance/Lineage and Data Discovery: > * Ability to store data provenance/lineage apart from statistics on the > data. > * Ability to discover data. For eg: If a new user needs to know where > click data for search ads is stored, he needs to go through twikis or mail > userlists to find where exactly it is stored. Need the ability to query on > keywords/producer of data and find which table contains the data. > > Regards, > Rohini > > On Mon, Oct 15, 2012 at 8:27 AM, Alan Gates <[email protected]> wrote: > > > Travis, > > > > Thanks for the thought provoking feedback. Comments inline. > > > > > > On Oct 13, 2012, at 9:55 AM, Travis Crawford wrote: > > > > > Hey Alan - > > > > > > Thanks for putting this roadmap together, its definitely well thought > > > out and appreciated that you took the time to do this. Some > > > questions/comments below: > > > > > > > > > METADATA > > > > > > These seem like the key items that make hcat attractive – shared table > > > service & read/write path for all processing frameworks. Something I'm > > > curious about are your thoughts for dealing with > > > unstructured/semi-structured data. So far I've personally considered > > > these outside the scope of hcat because the filesystem already does > > > that well. The table service really helps when you know stuff about > > > your data. But if you don't know anything about it, why put it in the > > > table service? > > > > A few thoughts: > > 1) There's a range between relationally structured data and data I know > > nothing about. I may know some things about the data (most rows have a > > user column) without knowing everything. > > 2) Schema knowledge is not the only value of a table service. Shared > > access paradigms and data models are also valuable. So even for data > where > > the schema is not known to the certainty you would expect in a relational > > store there is value in having a single access paradigm between tools. > > 3) The most interesting case here is self describing data (JSON, Avro, > > Thrift, etc.) Hive, Pig, and MR can all handle this, but I think > HCatalog > > could do a better job supporting this between the tools. > > > > > > > > 1. Enable sharing of Hive table data between diverse tools. > > > 2. Present users of these tools an abstraction that removes them from > > > details of where and in what format data and metadata are stored. > > > 6. Support data in all its forms in Hadoop. This includes structured, > > > semi-structured, and unstructured data. It also includes handling > > > schema transitions over time and HBase or Mongo like tables where each > > > row can present a different set of fields > > > 7. Provide a shared data type model across tools that includes the > > > data types that users expect in modern SQL. > > > > > > > > > INTERACTING WITH DATA > > > > > > What do you see the difference between 3 & 4 - they seem like the same > > > thing, or very similar. > > The difference here is in what the API is for. 3 is for tools that will > > clean, archive, replicate, and compact the data. These will need to > access > > all or most of the data and ask questions like, "what data should I be > > replicating to another cluster right now?" 4 is for systems like > external > > data stores that want to push or pull data or push processing to Hadoop. > > For example consider connecting a multi-processor database to Hadoop and > > allowing it to push down simple project/select queries and get the > answers > > in parallel streams of records. > > > > > > > > 3. Provide APIs to enable tools that manage the lifecycle of data in > > Hadoop. > > > 4. Provide APIs to external systems and external users that allow them > > > to interact efficiently with Hive table data in Hadoop. This includes > > > creating, altering, removing, exploring, reading, and writing table > > > data. > > > > > > > > > PHYSICAL STORAGE LAYER: > > > > > > Some goals are about the physical storage layer. Can you elaborate on > > > how this differs from HiveStorageHandler which already exists? > > > > I didn't mean to imply this was a different interface. > HiveStorageHandler > > just needs beefing up and maturing. It needs to be able to do alter > table, > > it needs implementations against more data stores, etc. > > > > > > > > 5. Provide APIs to allow Hive and other HCatalog clients to > > > transparently connect to external data stores and use them as Hive > > > tables . (e.g. S3, HBase, or any database or NoSQL store could be used > > > to store a Hive table) > > > 9. Provide tables that can accept streams of records and/or row > > > updates efficiently. > > > > > > > > > OTHER > > > > > > The registry stuff sounds like its getting into processing-framework > > > land, rather than being a table service. Being in the dependency > > > management business sounds pretty painful. Is this something people > > > are actually asking for? I understand having a server like webhcat > > > that you submit a query to and it executes it (so you just manage the > > > dependencies on that set of hosts), but having hooks to install code > > > in the processing framework submitting queries doesn't sound like > > > something I've heard people asking for. > > People definitely want to be able to store their UDFs in a shared place. > > This is also true for code needed to access data (SerDes, IF/OF, > > load/store functions). In any commercial database a user can register a > > UDF and have it stored for later use or for use by other users. > > > > I don't see this extending to dependency management though. It would be > > fine to allow the user to register a jar that is needed for a particular > > UDF. That jar must contain everything needed for that UDF; HCat won't > > figure out which jars that jar needs. > > > > > > > > 8. Embrace the Hive security model, while extending it to provided > > > needed protection to Hadoop data accessed via any tool or UDF. > > > 10. Provide a registry of SerDes, InputFormats and OutputFormats, and > > > StorageHandlers that allow HCatalog clients to reference any Hive > > > table in any storage format or on any data source without needing to > > > install code on their machines. > > > 11. Provide a Registry of UDFs and Table Functions that allow clients > > > to utilize registered UDFs from compatible tools by invoking them by > > > name. > > > > > > > > > > > > HOW DOES THIS AFFECT HIVE? > > > > > > One area we might expand on is how this roadmap affects Hive and its > > > users. Not much changes, really. If anything, changes are typical > > > engineering good practices, like having clear interfaces, separation > > > between components, so some components can be reused by other query > > > languages. > > > > > > Some areas of overlap are the Hive server & Hive web interfaces. How > > > do you see these compared to webhcat? Would it make sense to > > > potentially merge these into a single FE daemon that people use to > > > interact with the cluster? > > > > Between HCat and Hive we have at least four servers (HiveServer, thrift > > metastore, Hive web interfaces, and webhcat). This is absurd. I > > definitely agree that we need to figure out how to rationalize these into > > fewer servers. I think we also need to think about where job management > > services (like webhcat and Hive web interfaces have) go. I've been clear > > in the past that the job management services in webhcat are an historical > > artifact and they need to move, probably to Hadoop core. It makes sense > to > > me though that the DDL REST interfaces in webhcat become part of Hive and > > are hosted by one of the existing servers rather than requiring yet > another > > server. > > > > > > > > > > > OTHER THOUGHTS > > > > > > Overall I agree with the roadmap, and probably like everyone has the > > > parts I'm more interested in using than others. The parts "below" Hive > > > (table handlers basically) seem more like "these are contributions to > > > Hive that we're interested in making" rather than saying they're part > > > of hcat. The package management stuff seems like its getting into the > > > processing frameworks turf. > > > > I should have made clear I was thinking of this as a roadmap for the HCat > > community, not necessarily the HCat code. Due to our tight integration > > with Hive some of this will be accomplished in Hive. > > > > > > > > Do we see a goal of HCat as storing metadata about who's using the > > > data? Currently its only keeping track of metadata about the data > > > itself, not who's using it. We already using a complementary system > > > that keeps that data, and Jakob mentioned interest in this too. I'm > > > curious to hear if others are interested in this too. > > > > This is an open question. Clearly Hadoop needs job metadata. Whether > > it's best to combine that with table metadata or implement it in a > separate > > system, as you have done, is unclear to me. I lean towards agreeing with > > your implementation that separation here is best. But I suspect many > users > > would like one project to solve both, rather than having to deploy > HCatalog > > and a separate job metadata service. We could put an item in the roadmap > > noting that we need to consider this. > > > > Alan. > > > > > > > > > > > --travis > > > > > > > > > > > > > > > > > > On Fri, Oct 12, 2012 at 2:59 PM, Alan Gates <[email protected]> > > wrote: > > >> In thinking about approaching the Hive community to accept HCatalog as > > a subproject one thing that came out is it would be useful to have a > > roadmap for HCatalog. This will help the Hive community understand what > > HCatalog is and plans to be. > > >> > > >> It is also good for us as a community to discuss this. And it is good > > for potential contributors and users to understand what we want HCatalog > to > > be. Pig did this when it was a young project and it helped us > tremendously. > > >> > > >> I have published a proposed roadmap at > > https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+RoadmapPlease > > provide feedback on items I have put there plus any you believe > > that should be added. > > >> > > >> Alan. > > > > > -- Arup Malakar
