Re: Table Names in Spark Catalog

Valentin Kulichenko Fri, 24 Aug 2018 10:24:02 -0700

Nikolay,

If there are multiple configuration in XML, IgniteContext will always use
only one of them. Looks like current approach simply doesn't work. I
propose to report schema name as 'database' in Spark. If there are multiple
clients, you would create multiple sessions and multiple catalogs.


Makes sense?

-Val

On Fri, Aug 24, 2018 at 12:33 AM Nikolay Izhikov <[email protected]>
wrote:

> Hello, Valentin.
>
> > catalog exist in scope of a single IgniteSparkSession> (and therefore
> single IgniteContext and single Ignite instance)?
>
> Yes.
> Actually, I was thinking about use case when we have several Ignite
> configuration in one XML file.
> Now I see, may be this is too rare use-case to support.
>
> Stuart, Valentin, What is your proposal?
>
> В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:
> > Nikolay,
> >
> > Whatever we decide on would be right :) Basically, we need to answer this
> > question: does the catalog exist in scope of a single IgniteSparkSession
> > (and therefore single IgniteContext and single Ignite instance)? In other
> > words, in case of a rare use case when a single Spark application
> connects
> > to multiple Ignite clusters, would there be a catalog created per
> cluster?
> >
> > If the answer is yes, current logic doesn't make sense.
> >
> > -Val
> >
> >
> > On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <[email protected]>
> wrote:
> >
> > > Hello, Valentin.
> > >
> > > > I believe we should get rid of this logic and use Ignite schema name
> as
> > >
> > > database name in Spark's catalog.
> > >
> > > When I develop Ignite integration with Spark Data Frame I use following
> > > abstraction described by Vladimir Ozerov:
> > >
> > > "1) Let's consider Ignite cluster as a single database ("catalog" in
> ANSI
> > > SQL'92 terms)." [1]
> > >
> > > Am I was wrong? If yes - let's fix it.
> > >
> > > [1]
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> > >
> > > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > > Hi Val, yes that's correct. I'd be happy to make the change to have
> the
> > > > database reference the schema if Nikolay agrees. (I'll first need to
> do a
> > > > bit of research into how to obtain the list of all available
> schemata...)
> > > >
> > > > Thanks,
> > > > Stuart.
> > > >
> > > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > > [email protected]> wrote:
> > > >
> > > > > Stuart,
> > > > >
> > > > > Thanks for pointing this out, I was not aware that we use Spark
> > >
> > > database
> > > > > concept this way. Actually, this confuses me a lot. As far as I
> > >
> > > understand,
> > > > > catalog is created in the scope of a particular IgniteSparkSession,
> > >
> > > which
> > > > > in turn is assigned to a particular IgniteContext and therefore
> single
> > > > > Ignite client. If that's the case, I don't think it should be
> aware of
> > > > > other Ignite clients that are connected to other clusters. This
> doesn't
> > > > > look like correct behavior to me, not to mention that with this
> > >
> > > approach
> > > > > having multiple databases would be a very rare case. I believe we
> > >
> > > should
> > > > > get rid of this logic and use Ignite schema name as database name
> in
> > > > > Spark's catalog.
> > > > >
> > > > > Nikolay, what do you think?
> > > > >
> > > > > -Val
> > > > >
> > > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Nikolay, Val,
> > > > > >
> > > > > > The JDBC Spark datasource[1] -- as far as I can tell -- has no
> > > > > > ExternalCatalog implementation, it just uses the database
> specified
> > >
> > > in the
> > > > > > JDBC URL. So I don't believe there is any way to call
> listTables() or
> > > > > > listDatabases() for JDBC provider.
> > > > > >
> > > > > > The Hive ExternalCatalog[2] makes the distinction between
> database
> > >
> > > and
> > > > > > table using the actual database and table mechanisms built into
> the
> > > > > > catalog, which is fine because Hive has the clear distinction and
> > > > > > hierarchy
> > > > > > of databases and tables.
> > > > > >
> > > > > > *However* Ignite already uses the "database" concept in the
> Ignite
> > > > > >
> > > > > > ExternalCatalog[3] to mean the name of an Ignite instance. So in
> > >
> > > Ignite we
> > > > > > have instances containing schemas containing tables, and Spark
> only
> > >
> > > has
> > > > > > the
> > > > > > concept of databases and tables so it seems like either we ignore
> > >
> > > one of
> > > > > > the three Ignite concepts or combine two of them into database or
> > >
> > > table.
> > > > > > The current implementation in the pull request combines Ignite
> > >
> > > schema and
> > > > > > table attributes into the Spark table attribute.
> > > > > >
> > > > > > Stuart.
> > > > > >
> > > > > > [1]
> > > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > > datasources/jdbc/JDBCRelation.scala
> > > > > > [2]
> > > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > >
> src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > > [3]
> > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > > IgniteExternalCatalog.scala
> > > > > >
> > > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> > >
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello, Stuart.
> > > > > > >
> > > > > > > Can you do some research and find out how schema is handled in
> Data
> > > > > >
> > > > > > Frames
> > > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > > >
> > > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > > > > > > > Stuart, Nikolay,
> > > > > > > >
> > > > > > > > I see that the 'Table' class (returned by listTables method)
> has
> > >
> > > a
> > > > > > >
> > > > > > > 'database' field. Can we use this one to report schema name?
> > > > > > > >
> > > > > > > > In any case, I think we should look into how this is done in
> data
> > > > > >
> > > > > > source
> > > > > > > implementations for other databases. Any relational database
> has a
> > > > > >
> > > > > > notion
> > > > > > > of schema, and I'm sure Spark integrations take this into
> account
> > > > > >
> > > > > > somehow.
> > > > > > > >
> > > > > > > > -Val
> > > > > > > >
> > > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> > >
> > > [email protected]>
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > Hello, Stuart.
> > > > > > > > >
> > > > > > > > > Personally, I think we should change current tables naming
> and
> > > > > >
> > > > > > return
> > > > > > > table in form of `schema.table`.
> > > > > > > > >
> > > > > > > > > Valentin, could you share your opinion?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > > > > > > > Igniters,
> > > > > > > > > >
> > > > > > > > > > While reviewing the changes for IGNITE-9228 [1,2],
> Nikolay
> > >
> > > and I
> > > > > >
> > > > > > are
> > > > > > > > > > discussing whether to introduce a change which may impact
> > > > > >
> > > > > > backwards
> > > > > > > > > > compatibility; Nikolay suggested we take the discussion
> to
> > >
> > > this
> > > > > >
> > > > > > list.
> > > > > > > > > >
> > > > > > > > > > Ignite implements a custom Spark catalog which provides
> an
> > >
> > > API by
> > > > > > >
> > > > > > > which
> > > > > > > > > > Spark users can list the tables which are available in
> Ignite
> > > > > >
> > > > > > which
> > > > > > > can be
> > > > > > > > > > queried via Spark SQL. Currently that table name list
> > >
> > > includes
> > > > > >
> > > > > > just
> > > > > > > the
> > > > > > > > > > names of the tables, but IGNITE-9228 is introducing a
> change
> > >
> > > which
> > > > > > >
> > > > > > > allows
> > > > > > > > > > optional prefixing of schema names to table names to
> > >
> > > disambiguate
> > > > > > >
> > > > > > > multiple
> > > > > > > > > > tables with the same name in different schemas. For the
> "list
> > > > > > >
> > > > > > > tables" API
> > > > > > > > > > we therefore have two options:
> > > > > > > > > >
> > > > > > > > > > 1. List the tables using both their table names and
> > > > > >
> > > > > > schema-qualified
> > > > > > > table
> > > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even though
> > >
> > > they are
> > > > > > >
> > > > > > > the same
> > > > > > > > > > underlying table. This retains backwards compatibility
> with
> > >
> > > users
> > > > > >
> > > > > > who
> > > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > > 2. List the tables using only their schema-qualified
> names.
> > >
> > > This
> > > > > > >
> > > > > > > eliminates
> > > > > > > > > > duplication of names in the catalog but will potentially
> > >
> > > break
> > > > > > > > > > compatibility with users who expect the table name in the
> > >
> > > catalog.
> > > > > > > > > >
> > > > > > > > > > With either option we will allow for  Spark SQL SELECT
> > >
> > > statements
> > > > > >
> > > > > > to
> > > > > > > use
> > > > > > > > > > either table name or schema-qualified table names, this
> > >
> > > change
> > > > > >
> > > > > > would
> > > > > > > purely
> > > > > > > > > > impact the API which is used to list available tables.
> > > > > > > > > >
> > > > > > > > > > Any opinions would be welcome.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Stuart.
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > > [2] https://github.com/apache/ignite/pull/4551

Re: Table Names in Spark Catalog

Reply via email to