[jira] [Created] (SPARK-38642) spark-sql can not enable isolatedClientLoader to extend dsv2 catalog when using builtin hiveMetastoreJar

2022-03-24 Thread suheng.cloud (Jira)
suheng.cloud created SPARK-38642:


 Summary: spark-sql can not enable isolatedClientLoader to extend 
dsv2 catalog when using builtin hiveMetastoreJar
 Key: SPARK-38642
 URL: https://issues.apache.org/jira/browse/SPARK-38642
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.1.2
Reporter: suheng.cloud


Hi, all:

I make use of IsolatedClientLoader to enable datasource v2 catalog on hive, It 
works well on api/spark-shell, while failed on spark-sql cmd.

After dig into source, I found that the SparkSQLCLIDriver(spark-sql) initialize 
differently by using CliSessionState which will be reused through cli lifecycle.

Thus the IsolatedClientLoader creator in HiveUtils will determine to off 
isolate because encoutering special global SessionState by that type.In my 
case, namespaces/tables will not recognized from another hive catalog since a 
CliSessionState in sparkSession will always be used to connected with.

I notice [SPARK-21428|https://issues.apache.org/jira/browse/SPARK-21428] but 
think that since the datasource v2 api should be more popular, 
SparkSQLCLIDriver should also adjust that?

my env:

spark-3.1.2
hadoop-cdh5.13.0
hive-2.3.6
for each v2 catalog set spark.sql.hive.metastore.jars=builtin(we have no auth 
to deploy jars on target clusters)

Now, for workaround this, we have to deploy jars on hdfs and use 'path' way 
which cause a significant delay on catalog initialize.

Any help is appreciate, thanks.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan

2021-09-17 Thread suheng.cloud (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416987#comment-17416987
 ] 

suheng.cloud commented on SPARK-36776:
--

Thank you Hyukjin & Huaxin~

> Partition filter of DataSourceV2ScanRelation can not push down when select 
> none dataSchema from FileScan
> 
>
> Key: SPARK-36776
> URL: https://issues.apache.org/jira/browse/SPARK-36776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: suheng.cloud
>Priority: Major
>
> In PruneFileSourcePartitions rule, the FileScan::withFilters is called to 
> push down partition prune filter(and this is the only place this function can 
> be called), but it has a constraint that “scan.readDataSchema.nonEmpty” 
>  [source code 
> here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114]
>  We use spark sql in custom catalog and execute the count sql like: select 
> count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not 
> select any col reference to tbl), in which dt is a partition key.
> In this case the scan.readDataSchema is empty indeed and no scan partition 
> prune performed, which cause scan all partition at last.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan

2021-09-16 Thread suheng.cloud (Jira)
suheng.cloud created SPARK-36776:


 Summary: Partition filter of DataSourceV2ScanRelation can not push 
down when select none dataSchema from FileScan
 Key: SPARK-36776
 URL: https://issues.apache.org/jira/browse/SPARK-36776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: suheng.cloud


In PruneFileSourcePartitions rule, the FileScan::withFilters is called to push 
down partition prune filter(and this is the only place this function can be 
called), but it has a constraint that “scan.readDataSchema.nonEmpty” 
 [source code 
here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114]
 We use spark sql in custom catalog and execute the count sql like: select 
count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not 
select any col reference to tbl), in which dt is a partition key.

In this case the scan.readDataSchema is empty indeed and no scan partition 
prune performed, which cause scan all partition at last.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36706) OverwriteByExpression conversion in DataSourceV2Strategy use wrong deleteExpr translation

2021-09-09 Thread suheng.cloud (Jira)
suheng.cloud created SPARK-36706:


 Summary: OverwriteByExpression conversion in DataSourceV2Strategy 
use wrong deleteExpr translation
 Key: SPARK-36706
 URL: https://issues.apache.org/jira/browse/SPARK-36706
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: suheng.cloud


spark version release-3.1.2

we develop a hive datasource v2 plugin to support join among multiple hive 
clusters.
find that there maybe a bug in OverwriteByExpression conversion

code debug at 
https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L216

where wrong param `deletExpr` used, which will result in duplicate filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org