Re: Table Names in Spark Catalog

Nikolay Izhikov Thu, 13 Sep 2018 08:13:00 -0700

Hello, Stuart.

Do you need any assistance with this task from me or other community member?


В Вт, 04/09/2018 в 19:03 +0300, Nikolay Izhikov пишет:
> Hello, Stuart.
> 
> Sorry for the silence.
> 
> I was swamped the last couple of days.
> 
> I think you can go forward and implement suggested solution.
> I'm -0 with it.
> So no block from my side, but I'm still no happy with abstractions :).
> 
> В Пн, 03/09/2018 в 09:35 +0100, Stuart Macdonald пишет:
> > Nikolay, Val, it would be good if we could reach agreement here so that I
> > can make the necessary modifications before the 2.7 cutoff.
> > 
> > Nikolay - would you be comfortable if I went ahead and made database=schema?
> > 
> > Stuart.
> > 
> > On Mon, Aug 27, 2018 at 10:22 PM Valentin Kulichenko <
> > [email protected]> wrote:
> > 
> > > Hi Nikolay,
> > > 
> > > I think it's actually pretty unfortunate that Spark uses term "database"
> > > here, as it essentially refers to a schema in my view. Usually, database 
> > > is
> > > something you create a physical connection to, and connection is bind to
> > > that database. To connect to another database you need to create a new
> > > connection. In Spark, however, you can switch between "databases" within a
> > > single session, which looks really weird to me because it's usually a
> > > characteristic of a schema. Having said that, I understand your concern,
> > > but I don't think there is an ideal solution.
> > > 
> > > As for your approach, I still don't understand how it will allow to fully
> > > support schemas in catalog.
> > > - How will you get a list of tables within a particular schema? In other
> > > words, what would listTables() method return?
> > > - How will you switch between the schemas?
> > > - Etc.
> > > 
> > > I still think assuming database=schema is the best we can do here, but I
> > > would be happy to hear another opinions from other community members.
> > > 
> > > OPTION_SCHEMA should definitely be introduced though (I thought we already
> > > did, no?). CREATE TABLE will be supported with this ticket:
> > > https://issues.apache.org/jira/browse/IGNITE-5780. For now we will have to
> > > throw an exception if custom schema name is provided when creating a Spark
> > > session, but table does not exist yet.
> > > 
> > > -Val
> > > 
> > > On Sun, Aug 26, 2018 at 7:56 AM Nikolay Izhikov <[email protected]>
> > > wrote:
> > > 
> > > > Igniters,
> > > > 
> > > > Personally, I don't like the solution with database == schema name.
> > > > 
> > > > 1. I think we should try to use the right abstractions.
> > > > schema == database doesn't sound right for me.
> > > > 
> > > > Do you want to answer to all of our users something like that:
> > > > 
> > > > - "How I can change Ignite SQL schema?"
> > > > - "This is obvious, just use setDatabase("MY_SCHEMA_NAME")".
> > > > 
> > > > 2. I think we restrict whole solution with that decision.
> > > > If Ignite will support multiple databases in the future we just don't
> > > 
> > > have
> > > > a place for it.
> > > > 
> > > > I think we should do the following:
> > > > 
> > > >         1. IgniteExternalCatalog should be able to return *ALL* tables
> > > > within Ignite instance.
> > > >         We shouldn't restrict tables list by schema by default.
> > > >         We should return tables with schema name - `schema.table`
> > > > 
> > > >         2. We should introduce `OPTION_SCHEMA` for a dataframe to 
> > > > specify
> > > > a schema.
> > > > 
> > > >         There is an issue with the second step: We can't use schema name
> > > > in `CREATE TABLE` clause.
> > > >         This is restriction of current Ignite SQL.
> > > > 
> > > >         I propose to make the following:
> > > > 
> > > >         1. For all write modes that requires the creation of table we
> > > > should disallow usage of table outside of `SQL_PUBLIC`
> > > >         or usage of `OPTION_SCHEMA`. We should throw proper exception 
> > > > for
> > > > this case.
> > > > 
> > > >         2. Create a ticket to support `CREATE TABLE` with custom schema
> > > > name.
> > > > 
> > > >         3. After resolving ticket from step 2 we can add full support of
> > > > custom schema to Spark integration.
> > > > 
> > > >         4. We should throw an exception if user try to use setDatabase.
> > > > 
> > > > Is that makes sense for you?
> > > > 
> > > > В Вс, 26/08/2018 в 14:09 +0100, Stuart Macdonald пишет:
> > > > > I'll go ahead and make the changes to represent the schema name as the
> > > > > database name for the purposes of the Spark catalog.
> > > > > 
> > > > > If anyone knows of an existing way to list all available schemata
> > > 
> > > within
> > > > an
> > > > > Ignite instance please let me know, otherwise the first task will be
> > > > > creating that mechanism.
> > > > > 
> > > > > Stuart.
> > > > > 
> > > > > On Fri, Aug 24, 2018 at 6:23 PM Valentin Kulichenko <
> > > > > [email protected]> wrote:
> > > > > 
> > > > > > Nikolay,
> > > > > > 
> > > > > > If there are multiple configuration in XML, IgniteContext will 
> > > > > > always
> > > > 
> > > > use
> > > > > > only one of them. Looks like current approach simply doesn't work. I
> > > > > > propose to report schema name as 'database' in Spark. If there are
> > > > 
> > > > multiple
> > > > > > clients, you would create multiple sessions and multiple catalogs.
> > > > > > 
> > > > > > Makes sense?
> > > > > > 
> > > > > > -Val
> > > > > > 
> > > > > > On Fri, Aug 24, 2018 at 12:33 AM Nikolay Izhikov <
> > > 
> > > [email protected]>
> > > > > > wrote:
> > > > > > 
> > > > > > > Hello, Valentin.
> > > > > > > 
> > > > > > > > catalog exist in scope of a single IgniteSparkSession> (and
> > > > 
> > > > therefore
> > > > > > > 
> > > > > > > single IgniteContext and single Ignite instance)?
> > > > > > > 
> > > > > > > Yes.
> > > > > > > Actually, I was thinking about use case when we have several 
> > > > > > > Ignite
> > > > > > > configuration in one XML file.
> > > > > > > Now I see, may be this is too rare use-case to support.
> > > > > > > 
> > > > > > > Stuart, Valentin, What is your proposal?
> > > > > > > 
> > > > > > > В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:
> > > > > > > > Nikolay,
> > > > > > > > 
> > > > > > > > Whatever we decide on would be right :) Basically, we need to
> > > > 
> > > > answer
> > > > > > 
> > > > > > this
> > > > > > > > question: does the catalog exist in scope of a single
> > > > > > 
> > > > > > IgniteSparkSession
> > > > > > > > (and therefore single IgniteContext and single Ignite instance)?
> > > 
> > > In
> > > > > > 
> > > > > > other
> > > > > > > > words, in case of a rare use case when a single Spark 
> > > > > > > > application
> > > > > > > 
> > > > > > > connects
> > > > > > > > to multiple Ignite clusters, would there be a catalog created 
> > > > > > > > per
> > > > > > > 
> > > > > > > cluster?
> > > > > > > > 
> > > > > > > > If the answer is yes, current logic doesn't make sense.
> > > > > > > > 
> > > > > > > > -Val
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <
> > > > 
> > > > [email protected]>
> > > > > > > 
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > > Hello, Valentin.
> > > > > > > > > 
> > > > > > > > > > I believe we should get rid of this logic and use Ignite
> > > 
> > > schema
> > > > > > 
> > > > > > name
> > > > > > > as
> > > > > > > > > 
> > > > > > > > > database name in Spark's catalog.
> > > > > > > > > 
> > > > > > > > > When I develop Ignite integration with Spark Data Frame I use
> > > > > > 
> > > > > > following
> > > > > > > > > abstraction described by Vladimir Ozerov:
> > > > > > > > > 
> > > > > > > > > "1) Let's consider Ignite cluster as a single database
> > > > 
> > > > ("catalog" in
> > > > > > > 
> > > > > > > ANSI
> > > > > > > > > SQL'92 terms)." [1]
> > > > > > > > > 
> > > > > > > > > Am I was wrong? If yes - let's fix it.
> > > > > > > > > 
> > > > > > > > > [1]
> > > > > > > > > 
> > > > > > 
> > > > > > 
> > > 
> > > http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> > > > > > > > > 
> > > > > > > > > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > > > > > > > > Hi Val, yes that's correct. I'd be happy to make the change
> > > 
> > > to
> > > > have
> > > > > > > 
> > > > > > > the
> > > > > > > > > > database reference the schema if Nikolay agrees. (I'll first
> > > > 
> > > > need
> > > > > > 
> > > > > > to
> > > > > > > do a
> > > > > > > > > > bit of research into how to obtain the list of all available
> > > > > > > 
> > > > > > > schemata...)
> > > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Stuart.
> > > > > > > > > > 
> > > > > > > > > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > > > > > > > > [email protected]> wrote:
> > > > > > > > > > 
> > > > > > > > > > > Stuart,
> > > > > > > > > > > 
> > > > > > > > > > > Thanks for pointing this out, I was not aware that we use
> > > > 
> > > > Spark
> > > > > > > > > 
> > > > > > > > > database
> > > > > > > > > > > concept this way. Actually, this confuses me a lot. As far
> > > > 
> > > > as I
> > > > > > > > > 
> > > > > > > > > understand,
> > > > > > > > > > > catalog is created in the scope of a particular
> > > > > > 
> > > > > > IgniteSparkSession,
> > > > > > > > > 
> > > > > > > > > which
> > > > > > > > > > > in turn is assigned to a particular IgniteContext and
> > > > 
> > > > therefore
> > > > > > > 
> > > > > > > single
> > > > > > > > > > > Ignite client. If that's the case, I don't think it should
> > > 
> > > be
> > > > > > > 
> > > > > > > aware of
> > > > > > > > > > > other Ignite clients that are connected to other clusters.
> > > > 
> > > > This
> > > > > > > 
> > > > > > > doesn't
> > > > > > > > > > > look like correct behavior to me, not to mention that with
> > > > 
> > > > this
> > > > > > > > > 
> > > > > > > > > approach
> > > > > > > > > > > having multiple databases would be a very rare case. I
> > > > 
> > > > believe we
> > > > > > > > > 
> > > > > > > > > should
> > > > > > > > > > > get rid of this logic and use Ignite schema name as
> > > 
> > > database
> > > > name
> > > > > > > 
> > > > > > > in
> > > > > > > > > > > Spark's catalog.
> > > > > > > > > > > 
> > > > > > > > > > > Nikolay, what do you think?
> > > > > > > > > > > 
> > > > > > > > > > > -Val
> > > > > > > > > > > 
> > > > > > > > > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <
> > > > > > > 
> > > > > > > [email protected]>
> > > > > > > > > > > wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > Nikolay, Val,
> > > > > > > > > > > > 
> > > > > > > > > > > > The JDBC Spark datasource[1] -- as far as I can tell --
> > > > 
> > > > has no
> > > > > > > > > > > > ExternalCatalog implementation, it just uses the 
> > > > > > > > > > > > database
> > > > > > > 
> > > > > > > specified
> > > > > > > > > 
> > > > > > > > > in the
> > > > > > > > > > > > JDBC URL. So I don't believe there is any way to call
> > > > > > > 
> > > > > > > listTables() or
> > > > > > > > > > > > listDatabases() for JDBC provider.
> > > > > > > > > > > > 
> > > > > > > > > > > > The Hive ExternalCatalog[2] makes the distinction 
> > > > > > > > > > > > between
> > > > > > > 
> > > > > > > database
> > > > > > > > > 
> > > > > > > > > and
> > > > > > > > > > > > table using the actual database and table mechanisms
> > > 
> > > built
> > > > into
> > > > > > > 
> > > > > > > the
> > > > > > > > > > > > catalog, which is fine because Hive has the clear
> > > > 
> > > > distinction
> > > > > > 
> > > > > > and
> > > > > > > > > > > > hierarchy
> > > > > > > > > > > > of databases and tables.
> > > > > > > > > > > > 
> > > > > > > > > > > > *However* Ignite already uses the "database" concept in
> > > 
> > > the
> > > > > > > 
> > > > > > > Ignite
> > > > > > > > > > > > 
> > > > > > > > > > > > ExternalCatalog[3] to mean the name of an Ignite
> > > 
> > > instance.
> > > > So
> > > > > > 
> > > > > > in
> > > > > > > > > 
> > > > > > > > > Ignite we
> > > > > > > > > > > > have instances containing schemas containing tables, and
> > > > 
> > > > Spark
> > > > > > > 
> > > > > > > only
> > > > > > > > > 
> > > > > > > > > has
> > > > > > > > > > > > the
> > > > > > > > > > > > concept of databases and tables so it seems like either
> > > 
> > > we
> > > > > > 
> > > > > > ignore
> > > > > > > > > 
> > > > > > > > > one of
> > > > > > > > > > > > the three Ignite concepts or combine two of them into
> > > > 
> > > > database
> > > > > > 
> > > > > > or
> > > > > > > > > 
> > > > > > > > > table.
> > > > > > > > > > > > The current implementation in the pull request combines
> > > > 
> > > > Ignite
> > > > > > > > > 
> > > > > > > > > schema and
> > > > > > > > > > > > table attributes into the Spark table attribute.
> > > > > > > > > > > > 
> > > > > > > > > > > > Stuart.
> > > > > > > > > > > > 
> > > > > > > > > > > > [1]
> > > > > > > > > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > > > > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > > > > > > > > datasources/jdbc/JDBCRelation.scala
> > > > > > > > > > > > [2]
> > > > > > > > > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > > > > > > > > 
> > > > > > > 
> > > > > > > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > > > > > > > > [3]
> > > > > > > > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > > > > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > > > > > > > > IgniteExternalCatalog.scala
> > > > > > > > > > > > 
> > > > > > > > > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> > > > > > > > > 
> > > > > > > > > [email protected]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > > Hello, Stuart.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Can you do some research and find out how schema is
> > > > 
> > > > handled
> > > > > > 
> > > > > > in
> > > > > > > Data
> > > > > > > > > > > > 
> > > > > > > > > > > > Frames
> > > > > > > > > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko
> > > > 
> > > > пишет:
> > > > > > > > > > > > > > Stuart, Nikolay,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I see that the 'Table' class (returned by listTables
> > > > > > 
> > > > > > method)
> > > > > > > has
> > > > > > > > > 
> > > > > > > > > a
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 'database' field. Can we use this one to report schema
> > > > 
> > > > name?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > In any case, I think we should look into how this is
> > > > 
> > > > done
> > > > > > 
> > > > > > in
> > > > > > > data
> > > > > > > > > > > > 
> > > > > > > > > > > > source
> > > > > > > > > > > > > implementations for other databases. Any relational
> > > > 
> > > > database
> > > > > > > 
> > > > > > > has a
> > > > > > > > > > > > 
> > > > > > > > > > > > notion
> > > > > > > > > > > > > of schema, and I'm sure Spark integrations take this
> > > 
> > > into
> > > > > > > 
> > > > > > > account
> > > > > > > > > > > > 
> > > > > > > > > > > > somehow.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > -Val
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> > > > > > > > > 
> > > > > > > > > [email protected]>
> > > > > > > > > > > > > 
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > Hello, Stuart.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Personally, I think we should change current 
> > > > > > > > > > > > > > > tables
> > > > > > 
> > > > > > naming
> > > > > > > and
> > > > > > > > > > > > 
> > > > > > > > > > > > return
> > > > > > > > > > > > > table in form of `schema.table`.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Valentin, could you share your opinion?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald
> > > > 
> > > > пишет:
> > > > > > > > > > > > > > > > Igniters,
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > While reviewing the changes for IGNITE-9228
> > > 
> > > [1,2],
> > > > > > > 
> > > > > > > Nikolay
> > > > > > > > > 
> > > > > > > > > and I
> > > > > > > > > > > > 
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > > discussing whether to introduce a change which
> > > 
> > > may
> > > > > > 
> > > > > > impact
> > > > > > > > > > > > 
> > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > compatibility; Nikolay suggested we take the
> > > > 
> > > > discussion
> > > > > > > 
> > > > > > > to
> > > > > > > > > 
> > > > > > > > > this
> > > > > > > > > > > > 
> > > > > > > > > > > > list.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Ignite implements a custom Spark catalog which
> > > > 
> > > > provides
> > > > > > > 
> > > > > > > an
> > > > > > > > > 
> > > > > > > > > API by
> > > > > > > > > > > > > 
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > Spark users can list the tables which are
> > > > 
> > > > available in
> > > > > > > 
> > > > > > > Ignite
> > > > > > > > > > > > 
> > > > > > > > > > > > which
> > > > > > > > > > > > > can be
> > > > > > > > > > > > > > > > queried via Spark SQL. Currently that table name
> > > > 
> > > > list
> > > > > > > > > 
> > > > > > > > > includes
> > > > > > > > > > > > 
> > > > > > > > > > > > just
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > names of the tables, but IGNITE-9228 is
> > > > 
> > > > introducing a
> > > > > > > 
> > > > > > > change
> > > > > > > > > 
> > > > > > > > > which
> > > > > > > > > > > > > 
> > > > > > > > > > > > > allows
> > > > > > > > > > > > > > > > optional prefixing of schema names to table 
> > > > > > > > > > > > > > > > names
> > > > 
> > > > to
> > > > > > > > > 
> > > > > > > > > disambiguate
> > > > > > > > > > > > > 
> > > > > > > > > > > > > multiple
> > > > > > > > > > > > > > > > tables with the same name in different schemas.
> > > > 
> > > > For the
> > > > > > > 
> > > > > > > "list
> > > > > > > > > > > > > 
> > > > > > > > > > > > > tables" API
> > > > > > > > > > > > > > > > we therefore have two options:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 1. List the tables using both their table names
> > > 
> > > and
> > > > > > > > > > > > 
> > > > > > > > > > > > schema-qualified
> > > > > > > > > > > > > table
> > > > > > > > > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ])
> > > 
> > > even
> > > > > > 
> > > > > > though
> > > > > > > > > 
> > > > > > > > > they are
> > > > > > > > > > > > > 
> > > > > > > > > > > > > the same
> > > > > > > > > > > > > > > > underlying table. This retains backwards
> > > > 
> > > > compatibility
> > > > > > > 
> > > > > > > with
> > > > > > > > > 
> > > > > > > > > users
> > > > > > > > > > > > 
> > > > > > > > > > > > who
> > > > > > > > > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > > > > > > > > 2. List the tables using only their
> > > > 
> > > > schema-qualified
> > > > > > > 
> > > > > > > names.
> > > > > > > > > 
> > > > > > > > > This
> > > > > > > > > > > > > 
> > > > > > > > > > > > > eliminates
> > > > > > > > > > > > > > > > duplication of names in the catalog but will
> > > > > > 
> > > > > > potentially
> > > > > > > > > 
> > > > > > > > > break
> > > > > > > > > > > > > > > > compatibility with users who expect the table
> > > 
> > > name
> > > > in
> > > > > > 
> > > > > > the
> > > > > > > > > 
> > > > > > > > > catalog.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > With either option we will allow for  Spark SQL
> > > > 
> > > > SELECT
> > > > > > > > > 
> > > > > > > > > statements
> > > > > > > > > > > > 
> > > > > > > > > > > > to
> > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > either table name or schema-qualified table
> > > 
> > > names,
> > > > this
> > > > > > > > > 
> > > > > > > > > change
> > > > > > > > > > > > 
> > > > > > > > > > > > would
> > > > > > > > > > > > > purely
> > > > > > > > > > > > > > > > impact the API which is used to list available
> > > > 
> > > > tables.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Any opinions would be welcome.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Stuart.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > [1]
> > > > 
> > > > https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4551

signature.asc
Description: This is a digitally signed message part

Re: Table Names in Spark Catalog

Reply via email to