Thanks for the update, Todd!

-- Philip

On Tue, Aug 7, 2018 at 12:34 AM Todd Lipcon <[email protected]>
wrote:

> Hi folks,
>
> It's been a few weeks since the last update on this topic, so wanted to
> check in and share the progress as well as some plans for the coming weeks.
>
> As far as progress is concerned, most of the work so far committed (or
> almost-committed) has been some extensive refactoring. You may notice that
> the majority of the Frontend now refers to catalog objects by interfaces
> such as FeTable, FeDb, FeFsTable, etc, rather than the concrete classes
> Table, Db, HdfsTable, etc. In addition, the catalog itself uses an
> interface instead of a specific implementation. A new impalad flag
> --use_local_catalog can be flipped on to switch in the "new" impalad
> catalog implementation.
>
> One notable exception to the above refactoring is DDL statements such as
> CREATE/DROP/ALTER. As we've worked on the project it's become clear that
> the DDL implementation is quite tightly interwoven with the concrete
> implementations since most of the operations make some attempt to
> incrementally modify catalog objects in-place to reflect the changes being
> made in external source systems. Additionally, the treatment of the catalog
> versioning and metadata propagation protocol is pretty delicate. Given
> that, we made a decision a few weeks back to continue to delegate DDL
> functions to the catalogd by the existing mechanisms.
>
> This turned up an interesting new set of problems. Specifically, if the
> impalad is fetching metadata directly from source systems, but the catalogd
> continues to operate with its "snapshotted" table-granularity caches of
> objects, it's possible (and even likely) that an impalad will have a
> *newer* view of metadata than the catalogd. This can result in very
> confusing scenarios, like:
>
> 1) a user creates a table through some external system like Hive or Spark
> 2) a user is able to see and query the table on the impalad, since it
> fetched the table metadata direct from HMS
> 3) the user wants to drop the table from Impala. When it sends the request
> to the catalogd, the catalogd will respond with a "table does not exist"
> error.
> 4) the user, confused, may attempt to "show tables" again, and will
> continue to see it existing.
>
> At this point the only way to resolve the inconsistency would be some set
> of invalidate queries. There are many other similar scenarios where a user
> may get unexpected results or errors from DDLs based on the catalogd having
> an older cached copy than the impalad is showing.
>
> One way to fix these issues would be to fully re-implement all of the DDL
> statements without the tight interleaving with catalogd code. However, that
> will take a certain amount of time and bring more risk due to the
> complexity of those code paths.
>
> Additionally, during the design phase there were various concerns raised
> around the "fetch directly from source systems" approach, including:
> 1) increased load on source systems
> 2) potential for small semantics changes such as users who might rely on
> REFRESH "freezing" the view of a table
> 3) loss of various fringe features such as non-persistent functions, which
> currently are _only_ saved in catalogd memory
> 4) potentially more difficult to later implement "subscribe to source
> system" type functionality where new files or objects are discovered
> automatically
>
> Given the above, plus the risks/effort of re-implementing DDL operations
> decoupled from catalogd, I'm currently proposing that we shift the design
> of the project a bit. Namely, instead of being "fetch granular metadata
> on-demand from source systems with no catalogd", it will become "fetch
> granular metadata on-demand from catalogd". The vast majority of the
> refactoring work detailed at the top of this email is still relevant: the
> planner still needs to operate on lighter-weight objects with different
> properties than the catalog, and the machinery to switch around the
> behavior still makes sense. It's just that the actual _source_ of metadata
> will be a new RPC on the catalogd which allows on-demand granular metadata
> retrieval.
>
> A few other key points of this design are:
>
> - the impalads, when they fetch metadata, can note the version number of
> the catalog object that they fetched. This allows them to be sure that they
> will never get "read skew" in their cache -- all pieces of metadata for a
> given table need to have been read from the same version. Invalidation is
> also much easier with these consistent version numbers.
> - the catalogd, instead of broadcasting full catalog objects through the
> statestore, now just needs to send out notices of object version number
> changes. For example "table:functional.alltypes -> version 123". This
> message can cause the impalads to invalidate the cache for that table for
> any version less than 123, and all future queries are sure to re-fetch the
> latest data on-demand.
> - ideally we can make partition objects immutable, so that, given a
> particular partition ID, we know that that ID will never be reused with
> different data. On a REFRESH or other partition metadata change, a new ID
> can be assigned to the partition. This allows us to safely cache partitions
> (the largest bit of metadata) with long TTLs, and to do very cheap updates
> after any update affects only a subset of partitions.
>
> In terms of the original "fetch from source systems" design, I do continue
> to think it has promise longer-term as it simplifies the overall
> architecture and can help with cross-system metadata consistency. But we
> can treat fetch-from-catalogd as a nice interim that should bring most of
> the performance and scalability benefits to users sooner and with less
> risk.
>
> I'll plan to update the original design document to reflect this in coming
> days.
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Reply via email to