Re: Table Names in Spark Catalog

Valentin Kulichenko Tue, 21 Aug 2018 13:44:09 -0700

Stuart,

Thanks for pointing this out, I was not aware that we use Spark database
concept this way. Actually, this confuses me a lot. As far as I understand,
catalog is created in the scope of a particular IgniteSparkSession, which
in turn is assigned to a particular IgniteContext and therefore single
Ignite client. If that's the case, I don't think it should be aware of
other Ignite clients that are connected to other clusters. This doesn't
look like correct behavior to me, not to mention that with this approach
having multiple databases would be a very rare case. I believe we should
get rid of this logic and use Ignite schema name as database name in
Spark's catalog.


Nikolay, what do you think?

-Val

On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <stu...@stuwee.org> wrote:

> Nikolay, Val,
>
> The JDBC Spark datasource[1] -- as far as I can tell -- has no
> ExternalCatalog implementation, it just uses the database specified in the
> JDBC URL. So I don't believe there is any way to call listTables() or
> listDatabases() for JDBC provider.
>
> The Hive ExternalCatalog[2] makes the distinction between database and
> table using the actual database and table mechanisms built into the
> catalog, which is fine because Hive has the clear distinction and hierarchy
> of databases and tables.
>
> *However* Ignite already uses the "database" concept in the Ignite
> ExternalCatalog[3] to mean the name of an Ignite instance. So in Ignite we
> have instances containing schemas containing tables, and Spark only has the
> concept of databases and tables so it seems like either we ignore one of
> the three Ignite concepts or combine two of them into database or table.
> The current implementation in the pull request combines Ignite schema and
> table attributes into the Spark table attribute.
>
> Stuart.
>
> [1]
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
> [2]
>
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> [3]
>
> https://github.com/apache/ignite/blob/master/modules/spark/src/main/scala/org/apache/spark/sql/ignite/IgniteExternalCatalog.scala
>
> On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <nizhi...@apache.org>
> wrote:
>
> > Hello, Stuart.
> >
> > Can you do some research and find out how schema is handled in Data
> Frames
> > for a regular RDBMS such as Oracle, MySQL, etc?
> >
> > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > > Stuart, Nikolay,
> > >
> > > I see that the 'Table' class (returned by listTables method) has a
> > 'database' field. Can we use this one to report schema name?
> > >
> > > In any case, I think we should look into how this is done in data
> source
> > implementations for other databases. Any relational database has a notion
> > of schema, and I'm sure Spark integrations take this into account
> somehow.
> > >
> > > -Val
> > >
> > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <nizhi...@apache.org>
> > wrote:
> > > > Hello, Stuart.
> > > >
> > > > Personally, I think we should change current tables naming and return
> > table in form of `schema.table`.
> > > >
> > > > Valentin, could you share your opinion?
> > > >
> > > >
> > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > > Igniters,
> > > > >
> > > > > While reviewing the changes for IGNITE-9228 [1,2], Nikolay and I
> are
> > > > > discussing whether to introduce a change which may impact backwards
> > > > > compatibility; Nikolay suggested we take the discussion to this
> list.
> > > > >
> > > > > Ignite implements a custom Spark catalog which provides an API by
> > which
> > > > > Spark users can list the tables which are available in Ignite which
> > can be
> > > > > queried via Spark SQL. Currently that table name list includes just
> > the
> > > > > names of the tables, but IGNITE-9228 is introducing a change which
> > allows
> > > > > optional prefixing of schema names to table names to disambiguate
> > multiple
> > > > > tables with the same name in different schemas. For the "list
> > tables" API
> > > > > we therefore have two options:
> > > > >
> > > > > 1. List the tables using both their table names and
> schema-qualified
> > table
> > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even though they are
> > the same
> > > > > underlying table. This retains backwards compatibility with users
> who
> > > > > expect "myTable" to appear in the catalog.
> > > > > 2. List the tables using only their schema-qualified names. This
> > eliminates
> > > > > duplication of names in the catalog but will potentially break
> > > > > compatibility with users who expect the table name in the catalog.
> > > > >
> > > > > With either option we will allow for  Spark SQL SELECT statements
> to
> > use
> > > > > either table name or schema-qualified table names, this change
> would
> > purely
> > > > > impact the API which is used to list available tables.
> > > > >
> > > > > Any opinions would be welcome.
> > > > >
> > > > > Thanks,
> > > > > Stuart.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > [2] https://github.com/apache/ignite/pull/4551
> >
>

Re: Table Names in Spark Catalog

Reply via email to