[jira] [Created] (SPARK-42992) Introduce PySparkRuntimeError
Haejoon Lee created SPARK-42992: --- Summary: Introduce PySparkRuntimeError Key: SPARK-42992 URL: https://issues.apache.org/jira/browse/SPARK-42992 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Haejoon Lee Introduce PySparkRuntimeError to cover the RuntimeError in PySpark-specific way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42991) Disable string +/- interval in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-42991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707158#comment-17707158 ] Yuming Wang commented on SPARK-42991: - https://github.com/apache/spark/pull/40616 > Disable string +/- interval in ANSI mode > > > Key: SPARK-42991 > URL: https://issues.apache.org/jira/browse/SPARK-42991 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42991) Disable string type +/- interval in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-42991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-42991: Summary: Disable string type +/- interval in ANSI mode (was: Disable string +/- interval in ANSI mode) > Disable string type +/- interval in ANSI mode > - > > Key: SPARK-42991 > URL: https://issues.apache.org/jira/browse/SPARK-42991 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42972) ExecutorAllocationManager cannot allocate new instances when all executors down.
[ https://issues.apache.org/jira/browse/SPARK-42972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707143#comment-17707143 ] liang yu edited comment on SPARK-42972 at 3/31/23 7:13 AM: --- Using structured streaming, when we set the config to use dynamic allocation, there is a bug which will make the program hang. Here is how it happened: {code:scala} // Some comments here private def manageAllocation(): Unit = synchronized { logInfo(s"Managing executor allocation with ratios = [$scalingUpRatio, $scalingDownRatio]") if (batchProcTimeCount > 0) { val averageBatchProcTime = batchProcTimeSum / batchProcTimeCount val ratio = averageBatchProcTime.toDouble / batchDurationMs //When the ratio is lower than the scalingDownRatio, the client will try to kill executors, but if all executors are dead accidentally, the program will hung, because there is no executors to kill. logInfo(s"Average: $averageBatchProcTime, ratio = $ratio") if (ratio >= scalingUpRatio) { logDebug("Requesting executors") val numNewExecutors = math.max(math.round(ratio).toInt, 1) requestExecutors(numNewExecutors) } else if (ratio <= scalingDownRatio) { logDebug("Killing executors") killExecutor() } } batchProcTimeSum = 0 batchProcTimeCount = 0 //Then there will be no more batch jobs to complete, and batchProcTimeCount will always be 0, the program will stuck in suspended animation. } {code} was (Author: JIRAUSER299608): When the ratio is lower than the scalingDownRatio, the client will try to kill executors, but if all executors are dead accidentally, the program will hung, because there is no executors to kill. Then there will be no more batch jobs to complete, and batchProcTimeCount will always be 0, the program will stuck in suspended animation. > ExecutorAllocationManager cannot allocate new instances when all executors > down. > > > Key: SPARK-42972 > URL: https://issues.apache.org/jira/browse/SPARK-42972 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: Jiandan Yang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42972) ExecutorAllocationManager cannot allocate new instances when all executors down.
[ https://issues.apache.org/jira/browse/SPARK-42972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707143#comment-17707143 ] liang yu edited comment on SPARK-42972 at 3/31/23 7:16 AM: --- Using structured streaming, when we set the config to use dynamic allocation, there is a bug which will make the program hang. Here is how it happened: {code:scala} // Some comments here private def manageAllocation(): Unit = synchronized { logInfo(s"Managing executor allocation with ratios = [$scalingUpRatio, $scalingDownRatio]") if (batchProcTimeCount > 0) { val averageBatchProcTime = batchProcTimeSum / batchProcTimeCount val ratio = averageBatchProcTime.toDouble / batchDurationMs //When the ratio is lower than the scalingDownRatio, the client will try to kill executors, but if all executors are dead accidentally, the program will hung, because there is no executors to kill. logInfo(s"Average: $averageBatchProcTime, ratio = $ratio") if (ratio >= scalingUpRatio) { logDebug("Requesting executors") val numNewExecutors = math.max(math.round(ratio).toInt, 1) requestExecutors(numNewExecutors) } else if (ratio <= scalingDownRatio) { logDebug("Killing executors") killExecutor() } } batchProcTimeSum = 0 batchProcTimeCount = 0 //Then there will be no more batch jobs to complete, and batchProcTimeCount will always be 0, the program will stuck in suspended animation. } {code} When the ratio is lowe than the scalingDownRatio, the client will try to kill executors, but if all executors are dead accidentally at the same time, the program will hung, because there is no executors to kill. Then there will be no more batch jobs to complete, and batchProcTimeCount will always be 0, the program will stuck in suspended animation, because last time it tried to kill executors and requestExecutors function will never be triggered was (Author: JIRAUSER299608): Using structured streaming, when we set the config to use dynamic allocation, there is a bug which will make the program hang. Here is how it happened: {code:scala} // Some comments here private def manageAllocation(): Unit = synchronized { logInfo(s"Managing executor allocation with ratios = [$scalingUpRatio, $scalingDownRatio]") if (batchProcTimeCount > 0) { val averageBatchProcTime = batchProcTimeSum / batchProcTimeCount val ratio = averageBatchProcTime.toDouble / batchDurationMs //When the ratio is lower than the scalingDownRatio, the client will try to kill executors, but if all executors are dead accidentally, the program will hung, because there is no executors to kill. logInfo(s"Average: $averageBatchProcTime, ratio = $ratio") if (ratio >= scalingUpRatio) { logDebug("Requesting executors") val numNewExecutors = math.max(math.round(ratio).toInt, 1) requestExecutors(numNewExecutors) } else if (ratio <= scalingDownRatio) { logDebug("Killing executors") killExecutor() } } batchProcTimeSum = 0 batchProcTimeCount = 0 //Then there will be no more batch jobs to complete, and batchProcTimeCount will always be 0, the program will stuck in suspended animation. } {code} > ExecutorAllocationManager cannot allocate new instances when all executors > down. > > > Key: SPARK-42972 > URL: https://issues.apache.org/jira/browse/SPARK-42972 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: Jiandan Yang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42972) ExecutorAllocationManager cannot allocate new instances when all executors down.
[ https://issues.apache.org/jira/browse/SPARK-42972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707177#comment-17707177 ] liang yu commented on SPARK-42972: -- I created a PR [PR-40621|https://github.com/apache/spark/pull/40621] [~vkolpakov] Please help me review this PR > ExecutorAllocationManager cannot allocate new instances when all executors > down. > > > Key: SPARK-42972 > URL: https://issues.apache.org/jira/browse/SPARK-42972 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: Jiandan Yang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42993) Make Torch Distributor support Spark Connect
Ruifeng Zheng created SPARK-42993: - Summary: Make Torch Distributor support Spark Connect Key: SPARK-42993 URL: https://issues.apache.org/jira/browse/SPARK-42993 Project: Spark Issue Type: Sub-task Components: Connect, ML Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42994) Support sc.resources in Connect
Ruifeng Zheng created SPARK-42994: - Summary: Support sc.resources in Connect Key: SPARK-42994 URL: https://issues.apache.org/jira/browse/SPARK-42994 Project: Spark Issue Type: Sub-task Components: Connect, ML Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42978) Derby&PG: RENAME cannot qualify a new-table-Name with a schema-Name.
[ https://issues.apache.org/jira/browse/SPARK-42978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-42978. -- Fix Version/s: 3.5.0 Assignee: Kent Yao Resolution: Fixed issue resolved by [https://github.com/apache/spark/pull/40602] > Derby&PG: RENAME cannot qualify a new-table-Name with a schema-Name. > - > > Key: SPARK-42978 > URL: https://issues.apache.org/jira/browse/SPARK-42978 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.5.0 > > > https://db.apache.org/derby/docs/10.2/ref/rrefnewtablename.html#rrefnewtablename -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42995) Migrate Spark Connect DataFrame errors into error class
Haejoon Lee created SPARK-42995: --- Summary: Migrate Spark Connect DataFrame errors into error class Key: SPARK-42995 URL: https://issues.apache.org/jira/browse/SPARK-42995 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Haejoon Lee We should migrate the all errors into error class to leverage the PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan
[ https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Min updated SPARK-42935: - Description: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; set spark.sql.unionRequiredDistributionPushdown.enabled=true; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan does not contain exchange shuffle plan {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#0, id#7, name#8 DESC NULLS LAST +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0 +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200) :- FileScan csv spark_catalog.default.table1id#7,name#8 +- FileScan csv spark_catalog.default.table2id#9,name#10 {code} was: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan does not contain exchange shuffle plan {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number()
[jira] [Created] (SPARK-42996) Adding reason for test failure on Spark Connect parity tests.
Haejoon Lee created SPARK-42996: --- Summary: Adding reason for test failure on Spark Connect parity tests. Key: SPARK-42996 URL: https://issues.apache.org/jira/browse/SPARK-42996 Project: Spark Issue Type: Sub-task Components: Connect, Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Adding details to parity tests instead of just "Fails in Spark Connect, should enable". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42918. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40545 [https://github.com/apache/spark/pull/40545] > Generalize handling of metadata attributes in FileSourceStrategy > > > Key: SPARK-42918 > URL: https://issues.apache.org/jira/browse/SPARK-42918 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: Johan Lasperas >Priority: Minor > Fix For: 3.5.0 > > > A first step towards allowing file format implementations to inject custom > metadata fields into plans is to make the handling of metadata attributes in > `FileSourceStrategy` more generic. > Today in `FileSourceStrategy` , the lists of constant and generated metadata > fields are created manually, checking for known generated fields on one hand > and considering the remaining fields as constant metadata fields. We need > instead to introduce a way of declaring metadata fields as generated or > constant directly in `FileFormat` and propagate that information to > `FileSourceStrategy`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42918: --- Assignee: Johan Lasperas > Generalize handling of metadata attributes in FileSourceStrategy > > > Key: SPARK-42918 > URL: https://issues.apache.org/jira/browse/SPARK-42918 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: Johan Lasperas >Assignee: Johan Lasperas >Priority: Minor > Fix For: 3.5.0 > > > A first step towards allowing file format implementations to inject custom > metadata fields into plans is to make the handling of metadata attributes in > `FileSourceStrategy` more generic. > Today in `FileSourceStrategy` , the lists of constant and generated metadata > fields are created manually, checking for known generated fields on one hand > and considering the remaining fields as constant metadata fields. We need > instead to introduce a way of declaring metadata fields as generated or > constant directly in `FileFormat` and propagate that information to > `FileSourceStrategy`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42977) spark sql Disable vectorized faild
[ https://issues.apache.org/jira/browse/SPARK-42977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707256#comment-17707256 ] Jacek Laskowski commented on SPARK-42977: - Unless you can reproduce it without Iceberg, it's probably an Iceberg issue and should be reported in https://github.com/apache/iceberg/issues. > spark sql Disable vectorized faild > --- > > Key: SPARK-42977 > URL: https://issues.apache.org/jira/browse/SPARK-42977 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.2 >Reporter: liu >Priority: Major > Fix For: 3.3.2 > > > spark-sql config > {code:java} > ./spark-sql --packages > org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0\ > --conf > spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions > \ > --conf > spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ > --conf spark.sql.catalog.spark_catalog.type=hive \ > --conf spark.sql.iceberg.handle-timestamp-without-timezone=true \ > --conf spark.sql.parquet.binaryAsString=true \ > --conf spark.sql.parquet.enableVectorizedReader=false \ > --conf spark.sql.parquet.enableNestedColumnVectorizedReader=true \ > --conf spark.sql.parquet.recordLevelFilter=true {code} > > Now that I have configured spark. sql. queue. > enableVectorizedReader=false,but i query a iceberg parquet table,the > following error occurred: > > > {code:java} > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:498) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:286) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at > org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at > org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: > java.lang.UnsupportedOperationException: Cannot support vectorized reads for > column [hzxm] optional binary hzxm = 8 with encoding DELTA_BYTE_ARRAY. > Disable vectorized reads to read this table/file at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:100) > at > org.apache.iceberg.parquet.BasePageIterator.initFromPage(BasePageIterator.java:140) > at > org.apache.iceberg.parquet.BasePageIterator$1.visit(BasePageIterator.java:105) > at > org.apache.iceberg.parquet.BasePageIterator$1.visit(BasePageIterator.java:96) > at > org.apache.iceberg.shaded.org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192) > at > org.apache.iceberg.parquet.BasePageIterator.setPage(BasePageIterator.java:95) > at > org.apache.iceberg.parquet.BaseColumnIterator.advance(BaseColumnIterator.java:61) > at > org.apache.iceberg.parquet.BaseColumnIterator.setPageSource(BaseColumnIterator.java:50) > at > org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator.setRowGroupInfo(Vec > {code} > > > *{color:#FF}Caused by: java.lang.UnsupportedOperationException: Cannot > support vectorized reads for column [hzxm] optional binary hzxm = 8 with > encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this > table/file{color}* > > > Now it seems that this parameter has not worked. How can I turn off this > function so that I can successfully query the table -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42663) Fix `default_session` to work properly
[ https://issues.apache.org/jira/browse/SPARK-42663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707279#comment-17707279 ] Wencong Liu commented on SPARK-42663: - Hello [~itholic] , I am a beginner in Spark and I am very interested in this ticket. I understand that the key point of this issue is to make Spark remember the previous key in the config. I would like to ask if this problem only applies to "default_index_type" or if it is just an example. If you could help me by providing the specific code path, I would be very happy and willing to take on this issue. :) > Fix `default_session` to work properly > -- > > Key: SPARK-42663 > URL: https://issues.apache.org/jira/browse/SPARK-42663 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, default_session is not working properly in Spark Connect as below: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > Traceback (most recent call last): > ... > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (java.util.NoSuchElementException) default_index_type > {code} > It should work as expected in regular PySpark as below: > {code:java} > >>> spark = default_session() > >>> spark.conf.set("default_index_type", "sequence") > >>> spark.conf.get("default_index_type") > 'sequence' > >>> > >>> spark = default_session() > >>> spark.conf.get("default_index_type") > 'sequence'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42973) Upgrade buf to v1.16.0
[ https://issues.apache.org/jira/browse/SPARK-42973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42973: - Assignee: BingKun Pan > Upgrade buf to v1.16.0 > -- > > Key: SPARK-42973 > URL: https://issues.apache.org/jira/browse/SPARK-42973 > Project: Spark > Issue Type: Improvement > Components: Build, Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42973) Upgrade buf to v1.16.0
[ https://issues.apache.org/jira/browse/SPARK-42973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42973: -- Affects Version/s: 3.5.0 (was: 3.4.1) > Upgrade buf to v1.16.0 > -- > > Key: SPARK-42973 > URL: https://issues.apache.org/jira/browse/SPARK-42973 > Project: Spark > Issue Type: Improvement > Components: Build, Connect >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42973) Upgrade buf to v1.16.0
[ https://issues.apache.org/jira/browse/SPARK-42973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42973. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40596 [https://github.com/apache/spark/pull/40596] > Upgrade buf to v1.16.0 > -- > > Key: SPARK-42973 > URL: https://issues.apache.org/jira/browse/SPARK-42973 > Project: Spark > Issue Type: Improvement > Components: Build, Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42997) TableOutputResolver must use correct column paths in error messages for arrays and maps
Anton Okolnychyi created SPARK-42997: Summary: TableOutputResolver must use correct column paths in error messages for arrays and maps Key: SPARK-42997 URL: https://issues.apache.org/jira/browse/SPARK-42997 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.2, 3.3.1, 3.3.0, 3.3.3, 3.4.0, 3.4.1, 3.5.0 Reporter: Anton Okolnychyi TableOutputResolver must use correct column paths in error messages for arrays and maps. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42860) Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode
[ https://issues.apache.org/jira/browse/SPARK-42860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707402#comment-17707402 ] xiaochen zhou commented on SPARK-42860: --- https://github.com/apache/spark/pull/40626 > Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode > --- > > Key: SPARK-42860 > URL: https://issues.apache.org/jira/browse/SPARK-42860 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: arindam patra >Priority: Blocker > > We have a service that submits spark sql jobs to a spark cluster . > we want to validate the sql query before submitting the job . We are > currently using df.explain(extended=true) which generates parsed , analysed , > optimised logical plan and physical plan . > But generating optimised logical plan sometimes takes more time for e.g if > you have applied a filter on a partitioned column , spark will list all > directories and take the required ones . > For our query validation purpose this doesnt make sense and it would be great > if there is a explain mode that will only print the parsed and analysed > logical plans only -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42998) Fix DataFrame.collect with null struct.
Takuya Ueshin created SPARK-42998: - Summary: Fix DataFrame.collect with null struct. Key: SPARK-42998 URL: https://issues.apache.org/jira/browse/SPARK-42998 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin In Spark Connect: {code:python} >>> df = spark.sql("values (1, struct('a' as x)), (null, null) as t(a, b)") >>> df.show() +++ | a| b| +++ | 1| {a}| |null|null| +++ >>> df.collect() [Row(a=1, b=Row(x='a')), Row(a=None, b=)] {code} whereas PySpark: {code:python} >>> df.collect() [Row(a=1, b=Row(x='a')), Row(a=None, b=None)] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42999) Impl Dataset#foreach, foreachPartitions
Zhen Li created SPARK-42999: --- Summary: Impl Dataset#foreach, foreachPartitions Key: SPARK-42999 URL: https://issues.apache.org/jira/browse/SPARK-42999 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Zhen Li Impl the missing methods in Scala Client Dataset API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42860) Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode
[ https://issues.apache.org/jira/browse/SPARK-42860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707402#comment-17707402 ] xiaochen zhou edited comment on SPARK-42860 at 4/1/23 1:16 AM: --- [https://github.com/apache/spark/pull/40631] was (Author: zxcoccer): https://github.com/apache/spark/pull/40626 > Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode > --- > > Key: SPARK-42860 > URL: https://issues.apache.org/jira/browse/SPARK-42860 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: arindam patra >Priority: Blocker > > We have a service that submits spark sql jobs to a spark cluster . > we want to validate the sql query before submitting the job . We are > currently using df.explain(extended=true) which generates parsed , analysed , > optimised logical plan and physical plan . > But generating optimised logical plan sometimes takes more time for e.g if > you have applied a filter on a partitioned column , spark will list all > directories and take the required ones . > For our query validation purpose this doesnt make sense and it would be great > if there is a explain mode that will only print the parsed and analysed > logical plans only -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42860) Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode
[ https://issues.apache.org/jira/browse/SPARK-42860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707402#comment-17707402 ] xiaochen zhou edited comment on SPARK-42860 at 4/1/23 1:19 AM: --- [hi,I would like to deal with it. Can you assign this ticket to me ?|https://github.com/apache/spark/pull/40631] [https://github.com/apache/spark/pull/40631] was (Author: zxcoccer): [https://github.com/apache/spark/pull/40631] > Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode > --- > > Key: SPARK-42860 > URL: https://issues.apache.org/jira/browse/SPARK-42860 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: arindam patra >Priority: Blocker > > We have a service that submits spark sql jobs to a spark cluster . > we want to validate the sql query before submitting the job . We are > currently using df.explain(extended=true) which generates parsed , analysed , > optimised logical plan and physical plan . > But generating optimised logical plan sometimes takes more time for e.g if > you have applied a filter on a partitioned column , spark will list all > directories and take the required ones . > For our query validation purpose this doesnt make sense and it would be great > if there is a explain mode that will only print the parsed and analysed > logical plans only -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42860) Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode
[ https://issues.apache.org/jira/browse/SPARK-42860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707402#comment-17707402 ] xiaochen zhou edited comment on SPARK-42860 at 4/1/23 1:20 AM: --- [https://github.com/apache/spark/pull/40631] was (Author: zxcoccer): [hi,I would like to deal with it. Can you assign this ticket to me ?|https://github.com/apache/spark/pull/40631] [https://github.com/apache/spark/pull/40631] > Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode > --- > > Key: SPARK-42860 > URL: https://issues.apache.org/jira/browse/SPARK-42860 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: arindam patra >Priority: Blocker > > We have a service that submits spark sql jobs to a spark cluster . > we want to validate the sql query before submitting the job . We are > currently using df.explain(extended=true) which generates parsed , analysed , > optimised logical plan and physical plan . > But generating optimised logical plan sometimes takes more time for e.g if > you have applied a filter on a partitioned column , spark will list all > directories and take the required ones . > For our query validation purpose this doesnt make sense and it would be great > if there is a explain mode that will only print the parsed and analysed > logical plans only -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42998) Fix DataFrame.collect with null struct.
[ https://issues.apache.org/jira/browse/SPARK-42998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42998. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40627 [https://github.com/apache/spark/pull/40627] > Fix DataFrame.collect with null struct. > --- > > Key: SPARK-42998 > URL: https://issues.apache.org/jira/browse/SPARK-42998 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > > In Spark Connect: > {code:python} > >>> df = spark.sql("values (1, struct('a' as x)), (null, null) as t(a, b)") > >>> df.show() > +++ > | a| b| > +++ > | 1| {a}| > |null|null| > +++ > >>> df.collect() > [Row(a=1, b=Row(x='a')), Row(a=None, b=)] > {code} > whereas PySpark: > {code:python} > >>> df.collect() > [Row(a=1, b=Row(x='a')), Row(a=None, b=None)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42998) Fix DataFrame.collect with null struct.
[ https://issues.apache.org/jira/browse/SPARK-42998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42998: - Assignee: Takuya Ueshin > Fix DataFrame.collect with null struct. > --- > > Key: SPARK-42998 > URL: https://issues.apache.org/jira/browse/SPARK-42998 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > In Spark Connect: > {code:python} > >>> df = spark.sql("values (1, struct('a' as x)), (null, null) as t(a, b)") > >>> df.show() > +++ > | a| b| > +++ > | 1| {a}| > |null|null| > +++ > >>> df.collect() > [Row(a=1, b=Row(x='a')), Row(a=None, b=)] > {code} > whereas PySpark: > {code:python} > >>> df.collect() > [Row(a=1, b=Row(x='a')), Row(a=None, b=None)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42993) Make Torch Distributor compatible with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-42993: -- Summary: Make Torch Distributor compatible with Spark Connect (was: Make Torch Distributor support Spark Connect) > Make Torch Distributor compatible with Spark Connect > > > Key: SPARK-42993 > URL: https://issues.apache.org/jira/browse/SPARK-42993 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42993) Make Torch Distributor compatible with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707493#comment-17707493 ] Snoot.io commented on SPARK-42993: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40607 > Make Torch Distributor compatible with Spark Connect > > > Key: SPARK-42993 > URL: https://issues.apache.org/jira/browse/SPARK-42993 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41628) Support async query execution
[ https://issues.apache.org/jira/browse/SPARK-41628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707501#comment-17707501 ] Jia Fan commented on SPARK-41628: - I'm working on it > Support async query execution > - > > Key: SPARK-41628 > URL: https://issues.apache.org/jira/browse/SPARK-41628 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > > Today the query execution is completely synchronous, add an additional > asynchronous API that allows to submit and polll for the result. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37829) An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values
[ https://issues.apache.org/jira/browse/SPARK-37829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707502#comment-17707502 ] Clément de Groc commented on SPARK-37829: - I'm not planning to resume. I don't know that part of the codebase well enough to submit a better fix other than the one I already submitted in my PR. > An outer-join using joinWith on DataFrames returns Rows with null fields > instead of null values > --- > > Key: SPARK-37829 > URL: https://issues.apache.org/jira/browse/SPARK-37829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Clément de Groc >Priority: Major > > Doing an outer-join using {{joinWith}} on {{{}DataFrame{}}}s used to return > missing values as {{null}} in Spark 2.4.8, but returns them as {{Rows}} with > {{null}} values in Spark 3+. > The issue can be reproduced with [the following > test|https://github.com/cdegroc/spark/commit/79f4d6a1ec6c69b10b72dbc8f92ab6490d5ef5e5] > that succeeds on Spark 2.4.8 but fails starting from Spark 3.0.0. > The problem only arises when working with DataFrames: Datasets of case > classes work as expected as demonstrated by [this other > test|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L1200-L1223]. > I couldn't find an explanation for this change in the Migration guide so I'm > assuming this is a bug. > A {{git bisect}} pointed me to [that > commit|https://github.com/apache/spark/commit/cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59]. > Reverting the commit solves the problem. > A similar solution, but without reverting, is shown > [here|https://github.com/cdegroc/spark/commit/684c675bf070876a475a9b225f6c2f92edce4c8a]. > Happy to help if you think of another approach / can provide some guidance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org