[jira] [Created] (SPARK-38642) spark-sql can not enable isolatedClientLoader to extend dsv2 catalog when using builtin hiveMetastoreJar
suheng.cloud created SPARK-38642: Summary: spark-sql can not enable isolatedClientLoader to extend dsv2 catalog when using builtin hiveMetastoreJar Key: SPARK-38642 URL: https://issues.apache.org/jira/browse/SPARK-38642 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1, 3.1.2 Reporter: suheng.cloud Hi, all: I make use of IsolatedClientLoader to enable datasource v2 catalog on hive, It works well on api/spark-shell, while failed on spark-sql cmd. After dig into source, I found that the SparkSQLCLIDriver(spark-sql) initialize differently by using CliSessionState which will be reused through cli lifecycle. Thus the IsolatedClientLoader creator in HiveUtils will determine to off isolate because encoutering special global SessionState by that type.In my case, namespaces/tables will not recognized from another hive catalog since a CliSessionState in sparkSession will always be used to connected with. I notice [SPARK-21428|https://issues.apache.org/jira/browse/SPARK-21428] but think that since the datasource v2 api should be more popular, SparkSQLCLIDriver should also adjust that? my env: spark-3.1.2 hadoop-cdh5.13.0 hive-2.3.6 for each v2 catalog set spark.sql.hive.metastore.jars=builtin(we have no auth to deploy jars on target clusters) Now, for workaround this, we have to deploy jars on hdfs and use 'path' way which cause a significant delay on catalog initialize. Any help is appreciate, thanks. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan
[ https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416987#comment-17416987 ] suheng.cloud commented on SPARK-36776: -- Thank you Hyukjin & Huaxin~ > Partition filter of DataSourceV2ScanRelation can not push down when select > none dataSchema from FileScan > > > Key: SPARK-36776 > URL: https://issues.apache.org/jira/browse/SPARK-36776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: suheng.cloud >Priority: Major > > In PruneFileSourcePartitions rule, the FileScan::withFilters is called to > push down partition prune filter(and this is the only place this function can > be called), but it has a constraint that “scan.readDataSchema.nonEmpty” > [source code > here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114] > We use spark sql in custom catalog and execute the count sql like: select > count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not > select any col reference to tbl), in which dt is a partition key. > In this case the scan.readDataSchema is empty indeed and no scan partition > prune performed, which cause scan all partition at last. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan
suheng.cloud created SPARK-36776: Summary: Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan Key: SPARK-36776 URL: https://issues.apache.org/jira/browse/SPARK-36776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Reporter: suheng.cloud In PruneFileSourcePartitions rule, the FileScan::withFilters is called to push down partition prune filter(and this is the only place this function can be called), but it has a constraint that “scan.readDataSchema.nonEmpty” [source code here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114] We use spark sql in custom catalog and execute the count sql like: select count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not select any col reference to tbl), in which dt is a partition key. In this case the scan.readDataSchema is empty indeed and no scan partition prune performed, which cause scan all partition at last. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36706) OverwriteByExpression conversion in DataSourceV2Strategy use wrong deleteExpr translation
suheng.cloud created SPARK-36706: Summary: OverwriteByExpression conversion in DataSourceV2Strategy use wrong deleteExpr translation Key: SPARK-36706 URL: https://issues.apache.org/jira/browse/SPARK-36706 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Reporter: suheng.cloud spark version release-3.1.2 we develop a hive datasource v2 plugin to support join among multiple hive clusters. find that there maybe a bug in OverwriteByExpression conversion code debug at https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L216 where wrong param `deletExpr` used, which will result in duplicate filters -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org