Thanks Travis. Reply inline inline. Had forgotten to hit the send button on this one.
On Tue, Oct 16, 2012 at 8:41 AM, Travis Crawford <[email protected]>wrote: > Very good list Rohini. Since the thread is getting a bit long I > consolidated the suggestions so far in a Wishlist section of the wiki > page, so we don't lose track of them. > > > https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+Roadmap#HCatalogRoadmap-Wishlist > > Some comments inline: > > > On Mon, Oct 15, 2012 at 10:44 AM, Rohini Palaniswamy > <[email protected]> wrote: > > Alan, > > > > The proposed roadmap looks good. I especially like the idea of having a > > registry so that every user does not have to try and find the required > > correct serde jars as is happening with pig loaders/storer jars today. > Also > > +1 on hosting thrift + webhcat servers in one jvm. > > > > But there are few other things that would be good to have in the > roadmap. > > Have listed them below. > > > > Multiple cluster capabilities: > > * Process data from one cluster and store into another cluster. i.e > > HCatLoader() reads from one cluster's hcat server and HCatStorer() writes > > to another hcat server. This is something that we would be needing very > > soon. > > This is something we'll need too. Related to your export/import > comments, this is the direction I hope to take our data replicating > tools in, rather than separate distcp for the data, and metadata > import/export. A copy tool would have some policy about what to pull > from remote clusters, would scan or be notified of new partitions to > copy, and kicks of an HCat MR job to pull the data over. An added > benefit is data provenance info can be recorded to show the full > history for a particular analysis. Is this approach one that sounds > good to you, or is there a particular reason you want to copy data & > metadata separately? > Rohini: You are thinking on the lines of Goal 5 that Alan had mentioned > about data lifecycle management tools and replication/BCP. In that case you > need to backup both metadata and data or rather copy data to another hdfs > and metadata to another hcat server. But there will be cases where option > to separately backup(export) metadata is useful like when data is managed > externally (external tables). Another case is archival of data. You could > be mounting your hdfs through fuse and archiving data directly to tape, but > you will also need to get the metadata and store it along with archived > data so that during restore you can restore the state in the hcat server. > > > * HCat Metastore server to handle multiple cluster metadata. We would > > eventually like to have one hcat server per colo. > > Didn't the Hive folks look at this at some point, then abandon it. Any > context here? > > Rohini: Not familiar with the history. It would be good to know why they > abandoned the idea. But in general you will have to do lot of classloader > magic to separate out different versions of hadoop jars. It is painful but > feasible though and we have a product in production that interacts with all > the clusters that we have. > > > > > Export/Import: > > * Ability to import/export metadata out of hcat. Some level of support > > for this was added in hcat 0.2 but currently broken. This can be useful > for > > user backing up their hcat server metadata, ability to import into > another > > hcat metastore server, etc instead of having to replay add/alter > > partitions. For eg: We would like to move a project from one cluster to > > another and it has 1+ years worth of data in hcat server. Copying the > data > > can be easily done with distcp by copying the toplevel table directory > but > > copying the metadata is going to be more cumbersome. > > > > Data Provenance/Lineage and Data Discovery: > > * Ability to store data provenance/lineage apart from statistics on > the > > data. > > * Ability to discover data. For eg: If a new user needs to know where > > click data for search ads is stored, he needs to go through twikis or > mail > > userlists to find where exactly it is stored. Need the ability to query > on > > keywords/producer of data and find which table contains the data. > > This would be interesting to discuss further in the future. We're > already doing some of this, but its certainly something any HCat user > would be interested in. > > --travis > > > > > Regards, > > Rohini > > > > On Mon, Oct 15, 2012 at 8:27 AM, Alan Gates <[email protected]> > wrote: > > > >> Travis, > >> > >> Thanks for the thought provoking feedback. Comments inline. > >> > >> > >> On Oct 13, 2012, at 9:55 AM, Travis Crawford wrote: > >> > >> > Hey Alan - > >> > > >> > Thanks for putting this roadmap together, its definitely well thought > >> > out and appreciated that you took the time to do this. Some > >> > questions/comments below: > >> > > >> > > >> > METADATA > >> > > >> > These seem like the key items that make hcat attractive – shared table > >> > service & read/write path for all processing frameworks. Something I'm > >> > curious about are your thoughts for dealing with > >> > unstructured/semi-structured data. So far I've personally considered > >> > these outside the scope of hcat because the filesystem already does > >> > that well. The table service really helps when you know stuff about > >> > your data. But if you don't know anything about it, why put it in the > >> > table service? > >> > >> A few thoughts: > >> 1) There's a range between relationally structured data and data I know > >> nothing about. I may know some things about the data (most rows have a > >> user column) without knowing everything. > >> 2) Schema knowledge is not the only value of a table service. Shared > >> access paradigms and data models are also valuable. So even for data > where > >> the schema is not known to the certainty you would expect in a > relational > >> store there is value in having a single access paradigm between tools. > >> 3) The most interesting case here is self describing data (JSON, Avro, > >> Thrift, etc.) Hive, Pig, and MR can all handle this, but I think > HCatalog > >> could do a better job supporting this between the tools. > >> > >> > > >> > 1. Enable sharing of Hive table data between diverse tools. > >> > 2. Present users of these tools an abstraction that removes them from > >> > details of where and in what format data and metadata are stored. > >> > 6. Support data in all its forms in Hadoop. This includes structured, > >> > semi-structured, and unstructured data. It also includes handling > >> > schema transitions over time and HBase or Mongo like tables where each > >> > row can present a different set of fields > >> > 7. Provide a shared data type model across tools that includes the > >> > data types that users expect in modern SQL. > >> > > >> > > >> > INTERACTING WITH DATA > >> > > >> > What do you see the difference between 3 & 4 - they seem like the same > >> > thing, or very similar. > >> The difference here is in what the API is for. 3 is for tools that will > >> clean, archive, replicate, and compact the data. These will need to > access > >> all or most of the data and ask questions like, "what data should I be > >> replicating to another cluster right now?" 4 is for systems like > external > >> data stores that want to push or pull data or push processing to Hadoop. > >> For example consider connecting a multi-processor database to Hadoop > and > >> allowing it to push down simple project/select queries and get the > answers > >> in parallel streams of records. > >> > >> > > >> > 3. Provide APIs to enable tools that manage the lifecycle of data in > >> Hadoop. > >> > 4. Provide APIs to external systems and external users that allow them > >> > to interact efficiently with Hive table data in Hadoop. This includes > >> > creating, altering, removing, exploring, reading, and writing table > >> > data. > >> > > >> > > >> > PHYSICAL STORAGE LAYER: > >> > > >> > Some goals are about the physical storage layer. Can you elaborate on > >> > how this differs from HiveStorageHandler which already exists? > >> > >> I didn't mean to imply this was a different interface. > HiveStorageHandler > >> just needs beefing up and maturing. It needs to be able to do alter > table, > >> it needs implementations against more data stores, etc. > >> > >> > > >> > 5. Provide APIs to allow Hive and other HCatalog clients to > >> > transparently connect to external data stores and use them as Hive > >> > tables . (e.g. S3, HBase, or any database or NoSQL store could be used > >> > to store a Hive table) > >> > 9. Provide tables that can accept streams of records and/or row > >> > updates efficiently. > >> > > >> > > >> > OTHER > >> > > >> > The registry stuff sounds like its getting into processing-framework > >> > land, rather than being a table service. Being in the dependency > >> > management business sounds pretty painful. Is this something people > >> > are actually asking for? I understand having a server like webhcat > >> > that you submit a query to and it executes it (so you just manage the > >> > dependencies on that set of hosts), but having hooks to install code > >> > in the processing framework submitting queries doesn't sound like > >> > something I've heard people asking for. > >> People definitely want to be able to store their UDFs in a shared place. > >> This is also true for code needed to access data (SerDes, IF/OF, > >> load/store functions). In any commercial database a user can register a > >> UDF and have it stored for later use or for use by other users. > >> > >> I don't see this extending to dependency management though. It would be > >> fine to allow the user to register a jar that is needed for a particular > >> UDF. That jar must contain everything needed for that UDF; HCat won't > >> figure out which jars that jar needs. > >> > >> > > >> > 8. Embrace the Hive security model, while extending it to provided > >> > needed protection to Hadoop data accessed via any tool or UDF. > >> > 10. Provide a registry of SerDes, InputFormats and OutputFormats, and > >> > StorageHandlers that allow HCatalog clients to reference any Hive > >> > table in any storage format or on any data source without needing to > >> > install code on their machines. > >> > 11. Provide a Registry of UDFs and Table Functions that allow clients > >> > to utilize registered UDFs from compatible tools by invoking them by > >> > name. > >> > > >> > > >> > > >> > HOW DOES THIS AFFECT HIVE? > >> > > >> > One area we might expand on is how this roadmap affects Hive and its > >> > users. Not much changes, really. If anything, changes are typical > >> > engineering good practices, like having clear interfaces, separation > >> > between components, so some components can be reused by other query > >> > languages. > >> > > >> > Some areas of overlap are the Hive server & Hive web interfaces. How > >> > do you see these compared to webhcat? Would it make sense to > >> > potentially merge these into a single FE daemon that people use to > >> > interact with the cluster? > >> > >> Between HCat and Hive we have at least four servers (HiveServer, thrift > >> metastore, Hive web interfaces, and webhcat). This is absurd. I > >> definitely agree that we need to figure out how to rationalize these > into > >> fewer servers. I think we also need to think about where job management > >> services (like webhcat and Hive web interfaces have) go. I've been > clear > >> in the past that the job management services in webhcat are an > historical > >> artifact and they need to move, probably to Hadoop core. It makes > sense to > >> me though that the DDL REST interfaces in webhcat become part of Hive > and > >> are hosted by one of the existing servers rather than requiring yet > another > >> server. > >> > >> > > >> > > >> > OTHER THOUGHTS > >> > > >> > Overall I agree with the roadmap, and probably like everyone has the > >> > parts I'm more interested in using than others. The parts "below" Hive > >> > (table handlers basically) seem more like "these are contributions to > >> > Hive that we're interested in making" rather than saying they're part > >> > of hcat. The package management stuff seems like its getting into the > >> > processing frameworks turf. > >> > >> I should have made clear I was thinking of this as a roadmap for the > HCat > >> community, not necessarily the HCat code. Due to our tight integration > >> with Hive some of this will be accomplished in Hive. > >> > >> > > >> > Do we see a goal of HCat as storing metadata about who's using the > >> > data? Currently its only keeping track of metadata about the data > >> > itself, not who's using it. We already using a complementary system > >> > that keeps that data, and Jakob mentioned interest in this too. I'm > >> > curious to hear if others are interested in this too. > >> > >> This is an open question. Clearly Hadoop needs job metadata. Whether > >> it's best to combine that with table metadata or implement it in a > separate > >> system, as you have done, is unclear to me. I lean towards agreeing > with > >> your implementation that separation here is best. But I suspect many > users > >> would like one project to solve both, rather than having to deploy > HCatalog > >> and a separate job metadata service. We could put an item in the > roadmap > >> noting that we need to consider this. > >> > >> Alan. > >> > >> > > >> > > >> > --travis > >> > > >> > > >> > > >> > > >> > > >> > On Fri, Oct 12, 2012 at 2:59 PM, Alan Gates <[email protected]> > >> wrote: > >> >> In thinking about approaching the Hive community to accept HCatalog > as > >> a subproject one thing that came out is it would be useful to have a > >> roadmap for HCatalog. This will help the Hive community understand what > >> HCatalog is and plans to be. > >> >> > >> >> It is also good for us as a community to discuss this. And it is > good > >> for potential contributors and users to understand what we want > HCatalog to > >> be. Pig did this when it was a young project and it helped us > tremendously. > >> >> > >> >> I have published a proposed roadmap at > >> https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+RoadmapPlease > >> provide feedback on items I have put there plus any you believe > >> that should be added. > >> >> > >> >> Alan. > >> > >> >
