Hey Alan -

Thanks for putting this roadmap together, its definitely well thought
out and appreciated that you took the time to do this. Some
questions/comments below:


METADATA

These seem like the key items that make hcat attractive – shared table
service & read/write path for all processing frameworks. Something I'm
curious about are your thoughts for dealing with
unstructured/semi-structured data. So far I've personally considered
these outside the scope of hcat because the filesystem already does
that well. The table service really helps when you know stuff about
your data. But if you don't know anything about it, why put it in the
table service?

1. Enable sharing of Hive table data between diverse tools.
2. Present users of these tools an abstraction that removes them from
details of where and in what format data and metadata are stored.
6. Support data in all its forms in Hadoop. This includes structured,
semi-structured, and unstructured data. It also includes handling
schema transitions over time and HBase or Mongo like tables where each
row can present a different set of fields
7. Provide a shared data type model across tools that includes the
data types that users expect in modern SQL.


INTERACTING WITH DATA

What do you see the difference between 3 & 4 - they seem like the same
thing, or very similar.

3. Provide APIs to enable tools that manage the lifecycle of data in Hadoop.
4. Provide APIs to external systems and external users that allow them
to interact efficiently with Hive table data in Hadoop. This includes
creating, altering, removing, exploring, reading, and writing table
data.


PHYSICAL STORAGE LAYER:

Some goals are about the physical storage layer. Can you elaborate on
how this differs from HiveStorageHandler which already exists?

5. Provide APIs to allow Hive and other HCatalog clients to
transparently connect to external data stores and use them as Hive
tables . (e.g. S3, HBase, or any database or NoSQL store could be used
to store a Hive table)
9. Provide tables that can accept streams of records and/or row
updates efficiently.


OTHER

The registry stuff sounds like its getting into processing-framework
land, rather than being a table service. Being in the dependency
management business sounds pretty painful. Is this something people
are actually asking for? I understand having a server like webhcat
that you submit a query to and it executes it (so you just manage the
dependencies on that set of hosts), but having hooks to install code
in the processing framework submitting queries doesn't sound like
something I've heard people asking for.

8. Embrace the Hive security model, while extending it to provided
needed protection to Hadoop data accessed via any tool or UDF.
10. Provide a registry of SerDes, InputFormats and OutputFormats, and
StorageHandlers that allow HCatalog clients to reference any Hive
table in any storage format or on any data source without needing to
install code on their machines.
11. Provide a Registry of UDFs and Table Functions that allow clients
to utilize registered UDFs from compatible tools by invoking them by
name.



HOW DOES THIS AFFECT HIVE?

One area we might expand on is how this roadmap affects Hive and its
users. Not much changes, really. If anything, changes are typical
engineering good practices, like having clear interfaces, separation
between components, so some components can be reused by other query
languages.

Some areas of overlap are the Hive server & Hive web interfaces. How
do you see these compared to webhcat? Would it make sense to
potentially merge these into a single FE daemon that people use to
interact with the cluster?


OTHER THOUGHTS

Overall I agree with the roadmap, and probably like everyone has the
parts I'm more interested in using than others. The parts "below" Hive
(table handlers basically) seem more like "these are contributions to
Hive that we're interested in making" rather than saying they're part
of hcat. The package management stuff seems like its getting into the
processing frameworks turf.

Do we see a goal of HCat as storing metadata about who's using the
data? Currently its only keeping track of metadata about the data
itself, not who's using it. We already using a complementary system
that keeps that data, and Jakob mentioned interest in this too. I'm
curious to hear if others are interested in this too.


--travis





On Fri, Oct 12, 2012 at 2:59 PM, Alan Gates <[email protected]> wrote:
> In thinking about approaching the Hive community to accept HCatalog as a 
> subproject one thing that came out is it would be useful to have a roadmap 
> for HCatalog.  This will help the Hive community understand what HCatalog is 
> and plans to be.
>
> It is also good for us as a community to discuss this.  And it is good for 
> potential contributors and users to understand what we want HCatalog to be.  
> Pig did this when it was a young project and it helped us tremendously.
>
> I have published a proposed roadmap at 
> https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+Roadmap  Please 
> provide feedback on items I have put there plus any you believe that should 
> be added.
>
> Alan.

Reply via email to