[
https://issues.apache.org/jira/browse/SPARK-52854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sanford Ryza updated SPARK-52854:
---------------------------------
Description:
Setting the spark session default catalog and database is an imperative
construct that can cause friction and unexpected behavior from within a
pipeline declaration. E.g. it makes pipeline behavior sensitive to the order
that Python files are imported in, which can be unpredictable. There are
already existing mechanisms for setting Spark catalog and database for
pipelines:
* The catalog and database settings in the pipeline spec
* The name argument on the dataset decorators accepts a fully-qualified name
Raising an error when someone tries to invoke to set a catalog or database in
this situation would avoid this unpredictable behavior.
The ways to set the catalog and database from Python are:
* spark.catalog.setCurrentCatalog
* spark.sql("USE CATALOG")
* spark.catalog.setCurrentDatabase
* spark.sql("USE DATABASE")
The spark.sql usages will be covered by a separate JIRA.
was:
Setting the spark session default catalog and database is an imperative
construct that can cause friction and unexpected behavior from within a
pipeline declaration. E.g. it makes pipeline behavior sensitive to the order
that Python files are imported in, which can be unpredictable. There are
already existing mechanisms for setting Spark catalog and database for
pipelines:
* The catalog and database settings in the pipeline spec
* The name argument on the dataset decorators accepts a fully-qualified name
Raising an error when someone tries to invoke to set a catalog or database in
this situation would avoid this unpredictable behavior.
The ways to set the catalog and database from Python are:
* spark.catalog.setCurrentCatalog
* spark.sql("USE CATALOG")
* spark.catalog.setCurrentDatabase
* spark.sql("USE DATABASE")
> Prevent setCurrentDatabase and setCurrentCatalog within Pipelines Python
> definition files
> -----------------------------------------------------------------------------------------
>
> Key: SPARK-52854
> URL: https://issues.apache.org/jira/browse/SPARK-52854
> Project: Spark
> Issue Type: Sub-task
> Components: Declarative Pipelines
> Affects Versions: 4.1.0
> Reporter: Sandy Ryza
> Assignee: Jacky Wang
> Priority: Major
>
> Setting the spark session default catalog and database is an imperative
> construct that can cause friction and unexpected behavior from within a
> pipeline declaration. E.g. it makes pipeline behavior sensitive to the order
> that Python files are imported in, which can be unpredictable. There are
> already existing mechanisms for setting Spark catalog and database for
> pipelines:
> * The catalog and database settings in the pipeline spec
> * The name argument on the dataset decorators accepts a fully-qualified name
> Raising an error when someone tries to invoke to set a catalog or database in
> this situation would avoid this unpredictable behavior.
>
> The ways to set the catalog and database from Python are:
> * spark.catalog.setCurrentCatalog
> * spark.sql("USE CATALOG")
> * spark.catalog.setCurrentDatabase
> * spark.sql("USE DATABASE")
> The spark.sql usages will be covered by a separate JIRA.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]