Re: DataSourceV2 community sync #3

Xiao Li Sat, 01 Dec 2018 12:37:06 -0800

Hi, Ryan,

Catalog is a really important component for Spark SQL or any analytics
platform, I have to emphasize. Thus, a careful design is needed to ensure
it works as expected. Based on my previous discussion with many community
members, Spark SQL needs a catalog interface so that we can mount multiple
external physical catalogs and they can be presented as a single logical
catalog [which is a so-called global federated catalog]. In the future, we
can use this interface to develop our own catalog (instead of Hive
metastore) for more efficient metadata management. We can also plug in ACL
management if needed.


Based on your previous answers, it sounds like you have many ideas in your
mind about building a Catalog interface for Spark SQL, but it is not shown
in the design doc. Could you write them down in a single doc? We can try to
leave comments in the design doc, instead of discussing various issues in
PRs, emails and meetings. It can also help the whole community understand
your proposal and post their comments.

Thanks,

Xiao



Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午7:06写道：

> Xiao,
>
> For the questions in this last email about how catalogs interact and how
> functions and other future features work: we discussed those last night. As
> I said then, I think that the right approach is incremental. We don’t want
> to design all of that in one gigantic proposal up front. To do that is to
> put ourselves into analysis paralysis.
>
> We don’t have a design for how catalogs interact with one another, but I
> think we made a strong case for two points: first, that the proposed
> structure doesn’t preclude any of those future decisions (hence we should
> proceed incrementally). Second, that those situations aren’t that hard to
> think through if you’re concerned about them: functions that can run in
> Spark can be run on any data, functions that run in external sources cannot
> be run on any data.
>
> You’re right that I haven’t completely covered your *new* questions. But
> to the questions in your first email:
>
>    - You asked how, for example, Glue may be plugged in. That is well
>    covered in the PR that adds catalogs as a plugin
>    <https://github.com/apache/spark/pull/21306#issue-187572913>, the
>    response I sent to Wenchen’s questions, and the earlier discussion thread I
>    posted to this list with the subject “[DISCUSS] Multiple catalog support”.
>    The short answer is that implementations are configured with Spark config
>    properties and loaded with reflection.
>    - You asked how users implement an external catalog without adding new
>    data sources. That’s also covered in the “Multiple catalog support”
>    proposal, the table catalog PR, and ongoing discussions on the v2 redesign.
>    The answer is that a catalog returns a table instance that implements the
>    various interfaces from Wenchen’s work. A table may implement them directly
>    or return other existing implementations. Here’s how it worked in the
>    old API
>    
> <https://github.com/apache/spark/pull/21306/files#diff-db51e7934b9ee539ad599197a935cb86R35>
>    .
>
> I hope that you don’t think I expect you to go “without seeing the design”!
>
> rb
>
> On Thu, Nov 29, 2018 at 3:17 PM Xiao Li <gatorsm...@gmail.com> wrote:
>
>> Ryan,
>>
>> All the proposal I read is only related to Table metadata. Catalog
>> contains the metadata of database, functions, columns, views, and so on.
>> When we have multiple catalogs, how these catalogs interact with each
>> other? How the global catalog works? How a view, table, function, database
>> and column is resolved? Do we have nickname, mapping, wrapper?
>>
>> Or I might miss the design docs you send? Could you post the doc?
>>
>> Thanks,
>>
>> Xiao
>>
>>
>>
>>
>> Ryan Blue <rb...@netflix.com> 于2018年11月29日周四 下午3:06写道：
>>
>>> Xiao,
>>>
>>> Please have a look at the pull requests and documents I've posted over
>>> the last few months.
>>>
>>> If you still have questions about how you might plug in Glue, let me
>>> know and I can clarify.
>>>
>>> rb
>>>
>>> On Thu, Nov 29, 2018 at 2:56 PM Xiao Li <gatorsm...@gmail.com> wrote:
>>>
>>>> Ryan,
>>>>
>>>> Thanks for leading the discussion and sending out the memo!
>>>>
>>>>
>>>>> Xiao suggested that there are restrictions for how tables and
>>>>> functions interact. Because of this, he doesn’t think that separate
>>>>> TableCatalog and FunctionCatalog APIs are feasible.
>>>>
>>>>
>>>> Anything is possible. It depends on how we design the two interfaces.
>>>> Now, most parts are unknown to me without seeing the design.
>>>>
>>>> I think we need to see the user stories, and high-level design before
>>>> working on a small portion of Catalog federation. We do not need an
>>>> exhaustive design in the current stage, but we need to know how the new
>>>> proposal works. For example, how to plug in a new Hive metastore? How to
>>>> plug in a Glue? How do users implement a new external catalog without
>>>> adding any new data sources? Without knowing more details, it is hard to
>>>> say whether this TableCatalog can satisfy all the requirements.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>>
>>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年11月29日周四 下午2:32写道：
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Here are my notes from last night’s sync. Some attendees that joined
>>>>> during discussion may be missing, since I made the list while we were
>>>>> waiting for people to join.
>>>>>
>>>>> If you have topic suggestions for the next sync, please start sending
>>>>> them to me. Thank you!
>>>>>
>>>>> *Attendees:*
>>>>>
>>>>> Ryan Blue
>>>>> John Zhuge
>>>>> Jamison Bennett
>>>>> Yuanjian Li
>>>>> Xiao Li
>>>>> stczwd
>>>>> Matt Cheah
>>>>> Wenchen Fan
>>>>> Genglian Wang
>>>>> Kevin Yu
>>>>> Maryann Xue
>>>>> Cody Koeninger
>>>>> Bruce Robbins
>>>>> Rohit Karlupia
>>>>>
>>>>> *Agenda:*
>>>>>
>>>>>    - Follow-up issues or discussion on Wenchen’s PR #23086
>>>>>    - TableCatalog proposal
>>>>>    - CatalogTableIdentifier
>>>>>
>>>>> *Notes:*
>>>>>
>>>>>    - Discussion about PR #23086
>>>>>       - Where should the catalog API live since it needs to be
>>>>>       accessible to catalyst rules, but the catalyst module is private?
>>>>>       - Wenchen suggested creating a sql-api module for v2 API
>>>>>       interfaces, making catalyst depend on it
>>>>>       - Consensus was to use Wenchen’s suggestion
>>>>>    - In discussion about #23086, Xiao asked how adding catalog to a
>>>>>    table identifier will work
>>>>>       - Background from Ryan: existing code paths use TableIdentifier
>>>>>       and don’t expect a catalog portion. If an identifier with a catalog 
>>>>> were
>>>>>       passed to existing code, that code may use the default catalog not 
>>>>> knowing
>>>>>       that a different one was requested, which would be incorrect 
>>>>> behavior.
>>>>>       - Ryan: The proposal for CatalogTableIdentifier addresses this
>>>>>       problem. TableIdentifier is used for identifiers that have no 
>>>>> catalog set.
>>>>>       By enforcing that requirement, passing a TableIdentifier to old code
>>>>>       ensures that no catalogs leak into that code. This is also used 
>>>>> when the
>>>>>       catalog is set from context. For example, the TableCatalog API 
>>>>> accepts only
>>>>>       TableIdentifier because the catalog is already determined.
>>>>>    - Xiao asked whether FunctionIdentifier needs to be updated in the
>>>>>    same way as CatalogTableIdentifier.
>>>>>       - Ryan: Yes, when a FunctionCatalog API is added
>>>>>    - The remaining time was spent discussing whether the plan to
>>>>>    incrementally replace the current catalog API will work. [Not great 
>>>>> notes
>>>>>    here, feel free to add your take in a reply]
>>>>>       - Xiao suggested that there are restrictions for how tables and
>>>>>       functions interact. Because of this, he doesn’t think that separate
>>>>>       TableCatalog and FunctionCatalog APIs are feasible.
>>>>>       - Wenchen and Ryan think that functions should be orthogonal to
>>>>>       data sources
>>>>>       - Matt and Ryan think that catalog design can be done
>>>>>       incrementally as new interfaces (i.e. FunctionCatalog) are added 
>>>>> and that
>>>>>       the proposed TableCatalog does not preclude designing for Xiao’s 
>>>>> concerns
>>>>>       later
>>>>>       - [I forget who] pointed out that there are restrictions in
>>>>>       some databases for views from different sources
>>>>>       - There was some discussion about when functions or views
>>>>>       cannot be orthogonal. For example, where the code runs is important.
>>>>>       Functions pushed to sources cannot necessarily be run on other 
>>>>> sources and
>>>>>       Spark functions cannot necessarily be pushed down to sources.
>>>>>       - Xiao would like a full catalog replacement design, including
>>>>>       views, databases, and functions and how they interact, before moving
>>>>>       forward with the proposed TableCatalog API
>>>>>       - Ryan [and Matt, I think] think that TableCatalog is
>>>>>       compatible with future decisions and the best path forward is to 
>>>>> build
>>>>>       incrementally. An exhaustive design process blocks progress on v2.
>>>>>
>>>>>
>>>>> On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I just sent out an invite for the next DSv2 community sync for
>>>>>> Wednesday, 28 Nov at 5PM PST.
>>>>>>
>>>>>> We have a few topics left over from last time to cover. A few people
>>>>>> wanted to cover catalog APIs, so I put two items on the agenda:
>>>>>>
>>>>>>    - The TableCatalog proposal (and other catalog APIs)
>>>>>>    - Using CatalogTableIdentifier to separate v1 and v2 code paths
>>>>>>    and avoid unintended behavior changes
>>>>>>
>>>>>> As I noted in the summary last time, please send topics ahead of time
>>>>>> so we can get started more quickly.
>>>>>>
>>>>>> If you would like to be added to the google hangout invite, please
>>>>>> let me know and I’ll add you. Thanks!
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 community sync #3

Reply via email to