[ 
https://issues.apache.org/jira/browse/SPARK-55830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-55830:
-----------------------------------
    Labels: pull-request-available  (was: )

> JDBC predicate pushdown drops driver properties causing connection failure
> --------------------------------------------------------------------------
>
>                 Key: SPARK-55830
>                 URL: https://issues.apache.org/jira/browse/SPARK-55830
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.0.2
>            Reporter: Andreas Franz
>            Priority: Major
>              Labels: pull-request-available
>
> When reading data via JDBC using *predicate pushdown (predicates)*, Spark 
> appears to lose custom *JDBC driver properties* during connection creation.
>  
> This becomes problematic when using a *custom SocketFactory*, such as the 
> *Google Cloud SQL JDBC SocketFactory*, which relies on additional driver 
> properties for establishing the connection.
>  
> As a result, Spark fails to connect to the database when predicates are used.
>  
> Without predicates, the connection works correctly.
>  
> *Examples to reproduce:*
> {code:java}
> def get_jdbc_parameters():
>     return {
>         "url": f"jdbc:postgresql:///<DB_NAME>",
>         "socketFactory": "com.google.cloud.sql.postgres.SocketFactory",
>         "user": "<SERVICE ACCOUNT AUTHORIZED TO ACCESS CLOUDSQL DATABASE>",
>         "cloudSqlInstance": "<INSTANCE_CONNECTION_NAME>",
>         "cloudSqlTargetPrincipal": "<SERVICE ACCOUNT AUTHORIZED TO ACCESS 
> CLOUDSQL DATABASE>",
>         "enableIamAuth": "true",
>         "password": "disabled",
>         "sslmode": "disable",
>     }
> def get_table_name():
>     return "<TABLE>"
> # CASE 1: No predicates -> returns up to 10 rows as expected
> jdbc_parameters = get_jdbc_parameters()
> spark.read.jdbc(
>     url=jdbc_parameters.pop("url"),
>     table=get_table_name(),
>     properties=jdbc_parameters,
> ).limit(10).display()
> # CASE 2:  Empty predicated -> fails with exception below
> jdbc_parameters = get_jdbc_parameters()
> predicates = []
> spark.read.jdbc(
>     url=jdbc_parameters.pop("url"),
>     table=get_table_name(),
>     properties=jdbc_parameters,
>     predicates=predicates,
> ).limit(10).display()
> '''
> Py4JJavaError: An error occurred while calling o488.jdbc.
> : org.postgresql.util.PSQLException: Connection to :5432 refused. Check that 
> the hostname and port are correct and that the postmaster is accepting TCP/IP 
> connections.
> at 
> org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:342)
> at 
> org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
> at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
> at org.postgresql.Driver.makeConnection(Driver.java:443)
> at org.postgresql.Driver.connect(Driver.java:297)
> '''
> # CASE 3: Non-empty predicates -> fails with exception below
> jdbc_parameters = get_jdbc_parameters()
> predicates = ["id BETWEEN 1000 AND 1010 "]
> spark.read.jdbc(
>     url=jdbc_parameters.pop("url"),
>     table=get_table_name(),
>     properties=jdbc_parameters,
>     predicates=predicates,
> ).limit(10).display()
> '''
> Py4JJavaError: An error occurred while calling o488.jdbc.
> : org.postgresql.util.PSQLException: Connection to :5432 refused. Check that 
> the hostname and port are correct and that the postmaster is accepting TCP/IP 
> connections.
> at 
> org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:342)
> at 
> org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
> at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
> at org.postgresql.Driver.makeConnection(Driver.java:443)
> at org.postgresql.Driver.connect(Driver.java:297)
> '''
> # CASE 4: Predicates with JDBC parameters encoded in URL as workaround -> 
> only returns entries of the given predicate range
> jdbc_parameters = get_jdbc_parameters()
> predicates = ["id BETWEEN 1000 AND 1010 "]
> url = jdbc_parameters.pop("url")
> params = "&".join(f"{k}={v}" for k, v in jdbc_parameters.items())
> spark.read.jdbc(
>     url=f"{url}?{params}",
>     table=get_table_name(),
>     predicates=predicates,
> ).limit(10).display()
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to