[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...

rdblue Mon, 26 Nov 2018 17:09:02 -0800

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23086#discussion_r236487464
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java 
---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2;
    --- End diff --
    
    #21306 (TableCatalog support) adds this class as 
`org.apache.spark.sql.catalog.v2.Table` in the `spark-catalyst` module. I think 
it needs to be in the catalyst module and should probably be in the 
`o.a.s.sql.catalog.v2` package as well.
    
    The important one is moving this to the catalyst module. The analyzer is in 
catalyst and all of the v2 logical plans and analysis rules will be in catalyst 
as well, because we are standardizing behavior. The standard validation rules 
should be in catalyst, not in a source-specific or hive-specific package in the 
sql-core or hive modules.
    
    Because the logical plans and validation rules are in the catalyst package, 
the `TableCatalog` API needs to be there as well. For example, when a [catalog 
table identifier](https://github.com/apache/spark/pull/21978) is resolved for a 
read query, one of the results is a `TableCatalog` instance for the catalog 
portion of the identifier. That catalog is used to load the v2 table, which is 
then wrapped in a v2 relation for further analysis. Similarly, the write path 
should also validate that the catalog exists during analysis by loading it, and 
would then pass the catalog in a v2 logical plan for `CreateTable` or 
`CreateTableAsSelect`.
    
    I also think that it makes sense to use the 
`org.apache.spark.sql.catalog.v2` package for `Table` because `Table` is more 
closely tied to the `TableCatalog` API than to the data source API. The link to 
DSv2 is that `Table` carries `newScanBuilder`, but the rest of the methods 
exposed by `Table` are for catalog functions, like inspecting a table's 
partitioning or table properties.
    
    Moving this class would make adding `TableCatalog` less intrusive.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...

Reply via email to