[ 
https://issues.apache.org/jira/browse/SPARK-52854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanford Ryza updated SPARK-52854:
---------------------------------
    Description: 
Setting the spark session default catalog and database is an imperative 
construct that can cause friction and unexpected behavior from within a 
pipeline declaration. E.g. it makes pipeline behavior sensitive to the order 
that Python files are imported in, which can be unpredictable. There are 
already existing mechanisms for setting Spark catalog and database for 
pipelines:
 * The catalog and database settings in the pipeline spec
 * The name argument on the dataset decorators accepts a fully-qualified name

Raising an error when someone tries to invoke to set a catalog or database in 
this situation would avoid this unpredictable behavior.

 

The ways to set the catalog and database from Python are:
 * spark.catalog.setCurrentCatalog
 * spark.sql("USE CATALOG")
 * spark.catalog.setCurrentDatabase
 * spark.sql("USE DATABASE")

 The spark.sql usages will be covered by a separate JIRA.

  was:
Setting the spark session default catalog and database is an imperative 
construct that can cause friction and unexpected behavior from within a 
pipeline declaration. E.g. it makes pipeline behavior sensitive to the order 
that Python files are imported in, which can be unpredictable. There are 
already existing mechanisms for setting Spark catalog and database for 
pipelines:
 * The catalog and database settings in the pipeline spec
 * The name argument on the dataset decorators accepts a fully-qualified name

Raising an error when someone tries to invoke to set a catalog or database in 
this situation would avoid this unpredictable behavior.

 

The ways to set the catalog and database from Python are:
 * spark.catalog.setCurrentCatalog
 * spark.sql("USE CATALOG")
 * spark.catalog.setCurrentDatabase
 * spark.sql("USE DATABASE")

 


> Prevent setCurrentDatabase and setCurrentCatalog within Pipelines Python 
> definition files
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-52854
>                 URL: https://issues.apache.org/jira/browse/SPARK-52854
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Declarative Pipelines
>    Affects Versions: 4.1.0
>            Reporter: Sandy Ryza
>            Assignee: Jacky Wang
>            Priority: Major
>
> Setting the spark session default catalog and database is an imperative 
> construct that can cause friction and unexpected behavior from within a 
> pipeline declaration. E.g. it makes pipeline behavior sensitive to the order 
> that Python files are imported in, which can be unpredictable. There are 
> already existing mechanisms for setting Spark catalog and database for 
> pipelines:
>  * The catalog and database settings in the pipeline spec
>  * The name argument on the dataset decorators accepts a fully-qualified name
> Raising an error when someone tries to invoke to set a catalog or database in 
> this situation would avoid this unpredictable behavior.
>  
> The ways to set the catalog and database from Python are:
>  * spark.catalog.setCurrentCatalog
>  * spark.sql("USE CATALOG")
>  * spark.catalog.setCurrentDatabase
>  * spark.sql("USE DATABASE")
>  The spark.sql usages will be covered by a separate JIRA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to