[jira] [Commented] (SPARK-36440) Spark3 fails to read hive table with mixed format

2021-08-05 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394529#comment-17394529
 ] 

Chao Sun commented on SPARK-36440:
--

Hmm really? Spark 2.x support this? I'm not sure why Spark is still expected to 
work in this case since the serde is changed to Parquet but the underlying data 
file is in ORC. It seems like an error that users should avoid.

> Spark3 fails to read hive table with mixed format
> -
>
> Key: SPARK-36440
> URL: https://issues.apache.org/jira/browse/SPARK-36440
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.1, 3.1.2
>Reporter: Jason Xu
>Priority: Major
>
> Spark3 fails to read hive table with mixed format with hive Serde, this is a 
> regression compares to Spark 2.4. 
> Replication steps :
>  1. In spark 3 (3.0 or 3.1) spark shell:
> {code:java}
> scala> spark.sql("create table tmp.test_table (id int, name string) 
> partitioned by (pt int) stored as rcfile")
> scala> spark.sql("insert into tmp.test_table (pt = 1) values (1, 'Alice'), 
> (2, 'Bob')")
> {code}
> 2. Run hive command to change table file format (from RCFile to Parquet).
> {code:java}
> hive (default)> alter table set tmp.test_table fileformat Parquet;
> {code}
> 3. Try to read partition (in RCFile format) with hive serde using Spark shell:
> {code:java}
> scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false")
> scala> spark.sql("select * from tmp.test_table where pt=1").show{code}
> Exception: (anonymized file path with )
> {code:java}
> Caused by: java.lang.RuntimeException: 
> s3a:///data/part-0-22112178-5dd7-4065-89d7-2ee550296909-c000 is not 
> a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 
> 96, 1, -33]
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:79)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:285)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36162) extractJoinKeysWithColStats support EqualNullSafe

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394495#comment-17394495
 ] 

Apache Spark commented on SPARK-36162:
--

User 'changvvb' has created a pull request for this issue:
https://github.com/apache/spark/pull/33662

> extractJoinKeysWithColStats support EqualNullSafe
> -
>
> Key: SPARK-36162
> URL: https://issues.apache.org/jira/browse/SPARK-36162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> sql("select * from date_dim join item on d_date_sk = 
> i_item_sk").explain("cost")
> {noformat}
> == Optimized Logical Plan ==
> Join Inner, (d_date_sk#0 = i_item_sk#28), Statistics(sizeInBytes=1.0 B, 
> rowCount=0)
> :- Relation 
> default.date_dim[d_date_sk#0,d_date_id#1,d_date#2,d_month_seq#3,d_week_seq#4,d_quarter_seq#5,d_year#6,d_dow#7,d_moy#8,d_dom#9,d_qoy#10,d_fy_year#11,d_fy_quarter_seq#12,d_fy_week_seq#13,d_day_name#14,d_quarter_name#15,d_holiday#16,d_weekend#17,d_following_holiday#18,d_first_dom#19,d_last_dom#20,d_same_day_ly#21,d_same_day_lq#22,d_current_day#23,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.6 MiB, rowCount=7.30E+4)
> +- Relation 
> default.item[i_item_sk#28,i_item_id#29,i_rec_start_date#30,i_rec_end_date#31,i_item_desc#32,i_current_price#33,i_wholesale_cost#34,i_brand_id#35,i_brand#36,i_class_id#37,i_class#38,i_category_id#39,i_category#40,i_manufact_id#41,i_manufact#42,i_size#43,i_formulation#44,i_color#45,i_units#46,i_container#47,i_manager_id#48,i_product_name#49]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> sql("select * from date_dim join item on d_date_sk <=> 
> i_item_sk").explain("cost")
> {noformat}
> == Optimized Logical Plan ==
> Join Inner, (d_date_sk#0 <=> i_item_sk#28), Statistics(sizeInBytes=9.2 TiB, 
> rowCount=1.49E+10)
> :- Relation 
> default.date_dim[d_date_sk#0,d_date_id#1,d_date#2,d_month_seq#3,d_week_seq#4,d_quarter_seq#5,d_year#6,d_dow#7,d_moy#8,d_dom#9,d_qoy#10,d_fy_year#11,d_fy_quarter_seq#12,d_fy_week_seq#13,d_day_name#14,d_quarter_name#15,d_holiday#16,d_weekend#17,d_following_holiday#18,d_first_dom#19,d_last_dom#20,d_same_day_ly#21,d_same_day_lq#22,d_current_day#23,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.6 MiB, rowCount=7.30E+4)
> +- Relation 
> default.item[i_item_sk#28,i_item_id#29,i_rec_start_date#30,i_rec_end_date#31,i_item_desc#32,i_current_price#33,i_wholesale_cost#34,i_brand_id#35,i_brand#36,i_class_id#37,i_class#38,i_category_id#39,i_category#40,i_manufact_id#41,i_manufact#42,i_size#43,i_formulation#44,i_color#45,i_units#46,i_container#47,i_manager_id#48,i_product_name#49]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala#L329-L339



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36162) extractJoinKeysWithColStats support EqualNullSafe

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36162:


Assignee: Apache Spark

> extractJoinKeysWithColStats support EqualNullSafe
> -
>
> Key: SPARK-36162
> URL: https://issues.apache.org/jira/browse/SPARK-36162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> sql("select * from date_dim join item on d_date_sk = 
> i_item_sk").explain("cost")
> {noformat}
> == Optimized Logical Plan ==
> Join Inner, (d_date_sk#0 = i_item_sk#28), Statistics(sizeInBytes=1.0 B, 
> rowCount=0)
> :- Relation 
> default.date_dim[d_date_sk#0,d_date_id#1,d_date#2,d_month_seq#3,d_week_seq#4,d_quarter_seq#5,d_year#6,d_dow#7,d_moy#8,d_dom#9,d_qoy#10,d_fy_year#11,d_fy_quarter_seq#12,d_fy_week_seq#13,d_day_name#14,d_quarter_name#15,d_holiday#16,d_weekend#17,d_following_holiday#18,d_first_dom#19,d_last_dom#20,d_same_day_ly#21,d_same_day_lq#22,d_current_day#23,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.6 MiB, rowCount=7.30E+4)
> +- Relation 
> default.item[i_item_sk#28,i_item_id#29,i_rec_start_date#30,i_rec_end_date#31,i_item_desc#32,i_current_price#33,i_wholesale_cost#34,i_brand_id#35,i_brand#36,i_class_id#37,i_class#38,i_category_id#39,i_category#40,i_manufact_id#41,i_manufact#42,i_size#43,i_formulation#44,i_color#45,i_units#46,i_container#47,i_manager_id#48,i_product_name#49]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> sql("select * from date_dim join item on d_date_sk <=> 
> i_item_sk").explain("cost")
> {noformat}
> == Optimized Logical Plan ==
> Join Inner, (d_date_sk#0 <=> i_item_sk#28), Statistics(sizeInBytes=9.2 TiB, 
> rowCount=1.49E+10)
> :- Relation 
> default.date_dim[d_date_sk#0,d_date_id#1,d_date#2,d_month_seq#3,d_week_seq#4,d_quarter_seq#5,d_year#6,d_dow#7,d_moy#8,d_dom#9,d_qoy#10,d_fy_year#11,d_fy_quarter_seq#12,d_fy_week_seq#13,d_day_name#14,d_quarter_name#15,d_holiday#16,d_weekend#17,d_following_holiday#18,d_first_dom#19,d_last_dom#20,d_same_day_ly#21,d_same_day_lq#22,d_current_day#23,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.6 MiB, rowCount=7.30E+4)
> +- Relation 
> default.item[i_item_sk#28,i_item_id#29,i_rec_start_date#30,i_rec_end_date#31,i_item_desc#32,i_current_price#33,i_wholesale_cost#34,i_brand_id#35,i_brand#36,i_class_id#37,i_class#38,i_category_id#39,i_category#40,i_manufact_id#41,i_manufact#42,i_size#43,i_formulation#44,i_color#45,i_units#46,i_container#47,i_manager_id#48,i_product_name#49]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala#L329-L339



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36162) extractJoinKeysWithColStats support EqualNullSafe

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36162:


Assignee: (was: Apache Spark)

> extractJoinKeysWithColStats support EqualNullSafe
> -
>
> Key: SPARK-36162
> URL: https://issues.apache.org/jira/browse/SPARK-36162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> sql("select * from date_dim join item on d_date_sk = 
> i_item_sk").explain("cost")
> {noformat}
> == Optimized Logical Plan ==
> Join Inner, (d_date_sk#0 = i_item_sk#28), Statistics(sizeInBytes=1.0 B, 
> rowCount=0)
> :- Relation 
> default.date_dim[d_date_sk#0,d_date_id#1,d_date#2,d_month_seq#3,d_week_seq#4,d_quarter_seq#5,d_year#6,d_dow#7,d_moy#8,d_dom#9,d_qoy#10,d_fy_year#11,d_fy_quarter_seq#12,d_fy_week_seq#13,d_day_name#14,d_quarter_name#15,d_holiday#16,d_weekend#17,d_following_holiday#18,d_first_dom#19,d_last_dom#20,d_same_day_ly#21,d_same_day_lq#22,d_current_day#23,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.6 MiB, rowCount=7.30E+4)
> +- Relation 
> default.item[i_item_sk#28,i_item_id#29,i_rec_start_date#30,i_rec_end_date#31,i_item_desc#32,i_current_price#33,i_wholesale_cost#34,i_brand_id#35,i_brand#36,i_class_id#37,i_class#38,i_category_id#39,i_category#40,i_manufact_id#41,i_manufact#42,i_size#43,i_formulation#44,i_color#45,i_units#46,i_container#47,i_manager_id#48,i_product_name#49]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> sql("select * from date_dim join item on d_date_sk <=> 
> i_item_sk").explain("cost")
> {noformat}
> == Optimized Logical Plan ==
> Join Inner, (d_date_sk#0 <=> i_item_sk#28), Statistics(sizeInBytes=9.2 TiB, 
> rowCount=1.49E+10)
> :- Relation 
> default.date_dim[d_date_sk#0,d_date_id#1,d_date#2,d_month_seq#3,d_week_seq#4,d_quarter_seq#5,d_year#6,d_dow#7,d_moy#8,d_dom#9,d_qoy#10,d_fy_year#11,d_fy_quarter_seq#12,d_fy_week_seq#13,d_day_name#14,d_quarter_name#15,d_holiday#16,d_weekend#17,d_following_holiday#18,d_first_dom#19,d_last_dom#20,d_same_day_ly#21,d_same_day_lq#22,d_current_day#23,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.6 MiB, rowCount=7.30E+4)
> +- Relation 
> default.item[i_item_sk#28,i_item_id#29,i_rec_start_date#30,i_rec_end_date#31,i_item_desc#32,i_current_price#33,i_wholesale_cost#34,i_brand_id#35,i_brand#36,i_class_id#37,i_class#38,i_category_id#39,i_category#40,i_manufact_id#41,i_manufact#42,i_size#43,i_formulation#44,i_color#45,i_units#46,i_container#47,i_manager_id#48,i_product_name#49]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala#L329-L339



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36162) extractJoinKeysWithColStats support EqualNullSafe

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394494#comment-17394494
 ] 

Apache Spark commented on SPARK-36162:
--

User 'changvvb' has created a pull request for this issue:
https://github.com/apache/spark/pull/33662

> extractJoinKeysWithColStats support EqualNullSafe
> -
>
> Key: SPARK-36162
> URL: https://issues.apache.org/jira/browse/SPARK-36162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> sql("select * from date_dim join item on d_date_sk = 
> i_item_sk").explain("cost")
> {noformat}
> == Optimized Logical Plan ==
> Join Inner, (d_date_sk#0 = i_item_sk#28), Statistics(sizeInBytes=1.0 B, 
> rowCount=0)
> :- Relation 
> default.date_dim[d_date_sk#0,d_date_id#1,d_date#2,d_month_seq#3,d_week_seq#4,d_quarter_seq#5,d_year#6,d_dow#7,d_moy#8,d_dom#9,d_qoy#10,d_fy_year#11,d_fy_quarter_seq#12,d_fy_week_seq#13,d_day_name#14,d_quarter_name#15,d_holiday#16,d_weekend#17,d_following_holiday#18,d_first_dom#19,d_last_dom#20,d_same_day_ly#21,d_same_day_lq#22,d_current_day#23,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.6 MiB, rowCount=7.30E+4)
> +- Relation 
> default.item[i_item_sk#28,i_item_id#29,i_rec_start_date#30,i_rec_end_date#31,i_item_desc#32,i_current_price#33,i_wholesale_cost#34,i_brand_id#35,i_brand#36,i_class_id#37,i_class#38,i_category_id#39,i_category#40,i_manufact_id#41,i_manufact#42,i_size#43,i_formulation#44,i_color#45,i_units#46,i_container#47,i_manager_id#48,i_product_name#49]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> sql("select * from date_dim join item on d_date_sk <=> 
> i_item_sk").explain("cost")
> {noformat}
> == Optimized Logical Plan ==
> Join Inner, (d_date_sk#0 <=> i_item_sk#28), Statistics(sizeInBytes=9.2 TiB, 
> rowCount=1.49E+10)
> :- Relation 
> default.date_dim[d_date_sk#0,d_date_id#1,d_date#2,d_month_seq#3,d_week_seq#4,d_quarter_seq#5,d_year#6,d_dow#7,d_moy#8,d_dom#9,d_qoy#10,d_fy_year#11,d_fy_quarter_seq#12,d_fy_week_seq#13,d_day_name#14,d_quarter_name#15,d_holiday#16,d_weekend#17,d_following_holiday#18,d_first_dom#19,d_last_dom#20,d_same_day_ly#21,d_same_day_lq#22,d_current_day#23,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.6 MiB, rowCount=7.30E+4)
> +- Relation 
> default.item[i_item_sk#28,i_item_id#29,i_rec_start_date#30,i_rec_end_date#31,i_item_desc#32,i_current_price#33,i_wholesale_cost#34,i_brand_id#35,i_brand#36,i_class_id#37,i_class#38,i_category_id#39,i_category#40,i_manufact_id#41,i_manufact#42,i_size#43,i_formulation#44,i_color#45,i_units#46,i_container#47,i_manager_id#48,i_product_name#49]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala#L329-L339



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36415) Add docs for try_cast/try_add/try_divide

2021-08-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36415.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33638
[https://github.com/apache/spark/pull/33638]

> Add docs for try_cast/try_add/try_divide
> 
>
> Key: SPARK-36415
> URL: https://issues.apache.org/jira/browse/SPARK-36415
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Add docs for try_cast/try_add/try_divide



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36443) Demote BroadcastJoin causes performance regression and increases OOM risks

2021-08-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-36443:
-
Description: 
 

 

!image-2021-08-06-11-24-34-122.png!

  was:
 

!image-2021-08-06-11-24-34-122.png!


> Demote BroadcastJoin causes performance regression and increases OOM risks
> --
>
> Key: SPARK-36443
> URL: https://issues.apache.org/jira/browse/SPARK-36443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kent Yao
>Priority: Major
> Attachments: image-2021-08-06-11-24-34-122.png
>
>
>  
>  
> !image-2021-08-06-11-24-34-122.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36443) Demote BroadcastJoin causes performance regression and increases OOM risks

2021-08-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-36443:
-
Description: 
 

!image-2021-08-06-11-24-34-122.png!

  was:!image-2021-08-06-11-24-34-122.png!


> Demote BroadcastJoin causes performance regression and increases OOM risks
> --
>
> Key: SPARK-36443
> URL: https://issues.apache.org/jira/browse/SPARK-36443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kent Yao
>Priority: Major
> Attachments: image-2021-08-06-11-24-34-122.png
>
>
>  
> !image-2021-08-06-11-24-34-122.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36443) Demote BroadcastJoin causes performance regression and increases OOM risks

2021-08-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-36443:
-
Attachment: image-2021-08-06-11-24-34-122.png

> Demote BroadcastJoin causes performance regression and increases OOM risks
> --
>
> Key: SPARK-36443
> URL: https://issues.apache.org/jira/browse/SPARK-36443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kent Yao
>Priority: Major
> Attachments: image-2021-08-06-11-24-34-122.png
>
>
> !image-2021-08-06-11-19-00-105.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36443) Demote BroadcastJoin causes performance regression and increases OOM risks

2021-08-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-36443:
-
Description: !image-2021-08-06-11-24-34-122.png!  (was: 
!image-2021-08-06-11-19-00-105.png!)

> Demote BroadcastJoin causes performance regression and increases OOM risks
> --
>
> Key: SPARK-36443
> URL: https://issues.apache.org/jira/browse/SPARK-36443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kent Yao
>Priority: Major
> Attachments: image-2021-08-06-11-24-34-122.png
>
>
> !image-2021-08-06-11-24-34-122.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36443) Demote BroadcastJoin causes performance regression and increases OOM risks

2021-08-05 Thread Kent Yao (Jira)
Kent Yao created SPARK-36443:


 Summary: Demote BroadcastJoin causes performance regression and 
increases OOM risks
 Key: SPARK-36443
 URL: https://issues.apache.org/jira/browse/SPARK-36443
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.2
Reporter: Kent Yao


!image-2021-08-06-11-19-00-105.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36420) Use `isEmpty` to improve performance in Pregel's superstep

2021-08-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36420.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33648
[https://github.com/apache/spark/pull/33648]

> Use `isEmpty` to improve performance in Pregel's superstep
> --
>
> Key: SPARK-36420
> URL: https://issues.apache.org/jira/browse/SPARK-36420
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.4.7
>Reporter: xiepengjie
>Assignee: xiepengjie
>Priority: Minor
> Fix For: 3.3.0
>
>
> When I was running `Graphx.connectedComponents` with 20+ billion vertices and 
> edges, I found that count is very slow.
> {code:java}
> object Pregel extends Logging {
>   ...
>   def apply[VD: ClassTag, ED: ClassTag, A: ClassTag] (...): Graph[VD, ED] = {
> ...
> // Maybe messages.isEmpty() is better than messages.count()
> var activeMessages = messages.count()
> // Loop
> var prevG: Graph[VD, ED] = null
> var i = 0
> while (activeMessages > 0 && i < maxIterations) {
>   ...
>   activeMessages = messages.count()
>   ...
> }
> ...
> g
>   } // end of apply
> } // end of class Pregel
> {code}
> Maybe we only need an action operator here and active-messages are not empty, 
> so we don’t need to use count, it’s better to use isEmpty. I verified it and 
> it worked very well.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36420) Use `isEmpty` to improve performance in Pregel's superstep

2021-08-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36420:


Assignee: xiepengjie

> Use `isEmpty` to improve performance in Pregel's superstep
> --
>
> Key: SPARK-36420
> URL: https://issues.apache.org/jira/browse/SPARK-36420
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.4.7
>Reporter: xiepengjie
>Assignee: xiepengjie
>Priority: Minor
>
> When I was running `Graphx.connectedComponents` with 20+ billion vertices and 
> edges, I found that count is very slow.
> {code:java}
> object Pregel extends Logging {
>   ...
>   def apply[VD: ClassTag, ED: ClassTag, A: ClassTag] (...): Graph[VD, ED] = {
> ...
> // Maybe messages.isEmpty() is better than messages.count()
> var activeMessages = messages.count()
> // Loop
> var prevG: Graph[VD, ED] = null
> var i = 0
> while (activeMessages > 0 && i < maxIterations) {
>   ...
>   activeMessages = messages.count()
>   ...
> }
> ...
> g
>   } // end of apply
> } // end of class Pregel
> {code}
> Maybe we only need an action operator here and active-messages are not empty, 
> so we don’t need to use count, it’s better to use isEmpty. I verified it and 
> it worked very well.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36442) Do not reference the deprecated UserDefinedAggregateFunction

2021-08-05 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-36442:
---

 Summary: Do not reference the deprecated 
UserDefinedAggregateFunction
 Key: SPARK-36442
 URL: https://issues.apache.org/jira/browse/SPARK-36442
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan


https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-create-function.html#parameters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-36429) JacksonParser should throw exception when data type unsupported.

2021-08-05 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-36429:
---
Comment: was deleted

(was: I'm working on.)

> JacksonParser should throw exception when data type unsupported.
> 
>
> Key: SPARK-36429
> URL: https://issues.apache.org/jira/browse/SPARK-36429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
> different between from_json and from_csv.
> {code:java}
> -- !query
> select from_json('{"t":"26/October/2015"}', 't Timestamp', 
> map('timestampFormat', 'dd/M/'))
> -- !query schema
> struct>
> -- !query output
> {"t":null}
> {code}
> {code:java}
> -- !query
> select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
> 'dd/M/'))
> -- !query schema
> struct<>
> -- !query output
> java.lang.Exception
> Unsupported type: timestamp_ntz
> {code}
> We should make from_json throws exception too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36431) Support comparison of ANSI intervals with different fields

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36431:


Assignee: (was: Apache Spark)

> Support comparison of ANSI intervals with different fields
> --
>
> Key: SPARK-36431
> URL: https://issues.apache.org/jira/browse/SPARK-36431
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Support comparison of
> - a day-time interval with another day-time interval which has different 
> fields
> - a year-month interval with another year-month interval where fields are 
> different.
> The example below shows the issue:
> {code:sql}
> spark-sql> select interval '1' day > interval '1' hour;
> Error in query: cannot resolve '(INTERVAL '1' DAY > INTERVAL '01' HOUR)' due 
> to data type mismatch: differing types in '(INTERVAL '1' DAY > INTERVAL '01' 
> HOUR)' (interval day and interval hour).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '1' DAY > INTERVAL '01' HOUR), None)]
> +- OneRowRelation
> spark-sql> select interval '2' year > interval '11' month;
> Error in query: cannot resolve '(INTERVAL '2' YEAR > INTERVAL '11' MONTH)' 
> due to data type mismatch: differing types in '(INTERVAL '2' YEAR > INTERVAL 
> '11' MONTH)' (interval year and interval month).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '2' YEAR > INTERVAL '11' MONTH), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36431) Support comparison of ANSI intervals with different fields

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36431:


Assignee: Apache Spark

> Support comparison of ANSI intervals with different fields
> --
>
> Key: SPARK-36431
> URL: https://issues.apache.org/jira/browse/SPARK-36431
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Support comparison of
> - a day-time interval with another day-time interval which has different 
> fields
> - a year-month interval with another year-month interval where fields are 
> different.
> The example below shows the issue:
> {code:sql}
> spark-sql> select interval '1' day > interval '1' hour;
> Error in query: cannot resolve '(INTERVAL '1' DAY > INTERVAL '01' HOUR)' due 
> to data type mismatch: differing types in '(INTERVAL '1' DAY > INTERVAL '01' 
> HOUR)' (interval day and interval hour).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '1' DAY > INTERVAL '01' HOUR), None)]
> +- OneRowRelation
> spark-sql> select interval '2' year > interval '11' month;
> Error in query: cannot resolve '(INTERVAL '2' YEAR > INTERVAL '11' MONTH)' 
> due to data type mismatch: differing types in '(INTERVAL '2' YEAR > INTERVAL 
> '11' MONTH)' (interval year and interval month).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '2' YEAR > INTERVAL '11' MONTH), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36431) Support comparison of ANSI intervals with different fields

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394434#comment-17394434
 ] 

Apache Spark commented on SPARK-36431:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33661

> Support comparison of ANSI intervals with different fields
> --
>
> Key: SPARK-36431
> URL: https://issues.apache.org/jira/browse/SPARK-36431
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Support comparison of
> - a day-time interval with another day-time interval which has different 
> fields
> - a year-month interval with another year-month interval where fields are 
> different.
> The example below shows the issue:
> {code:sql}
> spark-sql> select interval '1' day > interval '1' hour;
> Error in query: cannot resolve '(INTERVAL '1' DAY > INTERVAL '01' HOUR)' due 
> to data type mismatch: differing types in '(INTERVAL '1' DAY > INTERVAL '01' 
> HOUR)' (interval day and interval hour).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '1' DAY > INTERVAL '01' HOUR), None)]
> +- OneRowRelation
> spark-sql> select interval '2' year > interval '11' month;
> Error in query: cannot resolve '(INTERVAL '2' YEAR > INTERVAL '11' MONTH)' 
> due to data type mismatch: differing types in '(INTERVAL '2' YEAR > INTERVAL 
> '11' MONTH)' (interval year and interval month).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '2' YEAR > INTERVAL '11' MONTH), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36420) Use `isEmpty` to improve performance in Pregel's superstep

2021-08-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36420:
-
Fix Version/s: (was: 3.3.0)

> Use `isEmpty` to improve performance in Pregel's superstep
> --
>
> Key: SPARK-36420
> URL: https://issues.apache.org/jira/browse/SPARK-36420
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.4.7
>Reporter: xiepengjie
>Priority: Minor
>
> When I was running `Graphx.connectedComponents` with 20+ billion vertices and 
> edges, I found that count is very slow.
> {code:java}
> object Pregel extends Logging {
>   ...
>   def apply[VD: ClassTag, ED: ClassTag, A: ClassTag] (...): Graph[VD, ED] = {
> ...
> // Maybe messages.isEmpty() is better than messages.count()
> var activeMessages = messages.count()
> // Loop
> var prevG: Graph[VD, ED] = null
> var i = 0
> while (activeMessages > 0 && i < maxIterations) {
>   ...
>   activeMessages = messages.count()
>   ...
> }
> ...
> g
>   } // end of apply
> } // end of class Pregel
> {code}
> Maybe we only need an action operator here and active-messages are not empty, 
> so we don’t need to use count, it’s better to use isEmpty. I verified it and 
> it worked very well.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36386) Fix DataFrame groupby-expanding to follow pandas 1.3

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394430#comment-17394430
 ] 

Apache Spark commented on SPARK-36386:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33646

> Fix DataFrame groupby-expanding to follow pandas 1.3
> 
>
> Key: SPARK-36386
> URL: https://issues.apache.org/jira/browse/SPARK-36386
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36386) Fix DataFrame groupby-expanding to follow pandas 1.3

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394429#comment-17394429
 ] 

Apache Spark commented on SPARK-36386:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33646

> Fix DataFrame groupby-expanding to follow pandas 1.3
> 
>
> Key: SPARK-36386
> URL: https://issues.apache.org/jira/browse/SPARK-36386
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36386) Fix DataFrame groupby-expanding to follow pandas 1.3

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36386:


Assignee: (was: Apache Spark)

> Fix DataFrame groupby-expanding to follow pandas 1.3
> 
>
> Key: SPARK-36386
> URL: https://issues.apache.org/jira/browse/SPARK-36386
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36386) Fix DataFrame groupby-expanding to follow pandas 1.3

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36386:


Assignee: Apache Spark

> Fix DataFrame groupby-expanding to follow pandas 1.3
> 
>
> Key: SPARK-36386
> URL: https://issues.apache.org/jira/browse/SPARK-36386
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36421) Validate all SQL configs to prevent from wrong use for ConfigEntry

2021-08-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36421:


Assignee: Kent Yao

> Validate all SQL configs to prevent from wrong use for ConfigEntry
> --
>
> Key: SPARK-36421
> URL: https://issues.apache.org/jira/browse/SPARK-36421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.2.0, 3.3.0
>
>
> ConfigEntry(key=spark.sql.hive.metastore.version, defaultValue=2.3.7, 
> doc=Version)
> should not go to the doc and set -v command



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36421) Validate all SQL configs to prevent from wrong use for ConfigEntry

2021-08-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36421:
-
Fix Version/s: 3.3.0
   3.2.0

> Validate all SQL configs to prevent from wrong use for ConfigEntry
> --
>
> Key: SPARK-36421
> URL: https://issues.apache.org/jira/browse/SPARK-36421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kent Yao
>Priority: Minor
> Fix For: 3.2.0, 3.3.0
>
>
> ConfigEntry(key=spark.sql.hive.metastore.version, defaultValue=2.3.7, 
> doc=Version)
> should not go to the doc and set -v command



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36421) Validate all SQL configs to prevent from wrong use for ConfigEntry

2021-08-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36421.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/33647

> Validate all SQL configs to prevent from wrong use for ConfigEntry
> --
>
> Key: SPARK-36421
> URL: https://issues.apache.org/jira/browse/SPARK-36421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kent Yao
>Priority: Minor
> Fix For: 3.2.0, 3.3.0
>
>
> ConfigEntry(key=spark.sql.hive.metastore.version, defaultValue=2.3.7, 
> doc=Version)
> should not go to the doc and set -v command



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36441) Downloading lintr dependencies fail on GA

2021-08-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36441.
--
Fix Version/s: 3.1.3
   3.2.0
   3.0.4
   Resolution: Fixed

Issue resolved by pull request 33660
[https://github.com/apache/spark/pull/33660]

> Downloading lintr dependencies fail on GA
> -
>
> Key: SPARK-36441
> URL: https://issues.apache.org/jira/browse/SPARK-36441
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.0.4, 3.2.0, 3.1.3
>
>
> Downloading lintr dependencies on GA fails.
> I re-triggered the GA job but it still fail with the same error.
> {code}
>  * installing *source* package ‘devtools’ ...
> ** package ‘devtools’ successfully unpacked and MD5 sums checked
> ** using staged installation
> ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> *** copying figures
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (devtools)
> The downloaded source packages are in
>   ‘/tmp/Rtmpv53Ix4/downloaded_packages’
> Using bundled GitHub PAT. Please add your own PAT to the env var `GITHUB_PAT`
> Error: Failed to install 'unknown package' from GitHub:
>   HTTP error 401.
>   Bad credentials
>   Rate limit remaining: 59/60
>   Rate limit reset at: 2021-08-06 01:37:46 UTC
>   
> Execution halted
> Error: Process completed with exit code 1.
> {code}
> https://github.com/apache/spark/runs/3257853825



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36441) Downloading lintr dependencies fail on GA

2021-08-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36441:


Assignee: Kousuke Saruta

> Downloading lintr dependencies fail on GA
> -
>
> Key: SPARK-36441
> URL: https://issues.apache.org/jira/browse/SPARK-36441
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Downloading lintr dependencies on GA fails.
> I re-triggered the GA job but it still fail with the same error.
> {code}
>  * installing *source* package ‘devtools’ ...
> ** package ‘devtools’ successfully unpacked and MD5 sums checked
> ** using staged installation
> ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> *** copying figures
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (devtools)
> The downloaded source packages are in
>   ‘/tmp/Rtmpv53Ix4/downloaded_packages’
> Using bundled GitHub PAT. Please add your own PAT to the env var `GITHUB_PAT`
> Error: Failed to install 'unknown package' from GitHub:
>   HTTP error 401.
>   Bad credentials
>   Rate limit remaining: 59/60
>   Rate limit reset at: 2021-08-06 01:37:46 UTC
>   
> Execution halted
> Error: Process completed with exit code 1.
> {code}
> https://github.com/apache/spark/runs/3257853825



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36441) Downloading lintr dependencies fail on GA

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36441:


Assignee: Apache Spark

> Downloading lintr dependencies fail on GA
> -
>
> Key: SPARK-36441
> URL: https://issues.apache.org/jira/browse/SPARK-36441
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> Downloading lintr dependencies on GA fails.
> I re-triggered the GA job but it still fail with the same error.
> {code}
>  * installing *source* package ‘devtools’ ...
> ** package ‘devtools’ successfully unpacked and MD5 sums checked
> ** using staged installation
> ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> *** copying figures
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (devtools)
> The downloaded source packages are in
>   ‘/tmp/Rtmpv53Ix4/downloaded_packages’
> Using bundled GitHub PAT. Please add your own PAT to the env var `GITHUB_PAT`
> Error: Failed to install 'unknown package' from GitHub:
>   HTTP error 401.
>   Bad credentials
>   Rate limit remaining: 59/60
>   Rate limit reset at: 2021-08-06 01:37:46 UTC
>   
> Execution halted
> Error: Process completed with exit code 1.
> {code}
> https://github.com/apache/spark/runs/3257853825



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36441) Downloading lintr dependencies fail on GA

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394413#comment-17394413
 ] 

Apache Spark commented on SPARK-36441:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33660

> Downloading lintr dependencies fail on GA
> -
>
> Key: SPARK-36441
> URL: https://issues.apache.org/jira/browse/SPARK-36441
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Priority: Major
>
> Downloading lintr dependencies on GA fails.
> I re-triggered the GA job but it still fail with the same error.
> {code}
>  * installing *source* package ‘devtools’ ...
> ** package ‘devtools’ successfully unpacked and MD5 sums checked
> ** using staged installation
> ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> *** copying figures
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (devtools)
> The downloaded source packages are in
>   ‘/tmp/Rtmpv53Ix4/downloaded_packages’
> Using bundled GitHub PAT. Please add your own PAT to the env var `GITHUB_PAT`
> Error: Failed to install 'unknown package' from GitHub:
>   HTTP error 401.
>   Bad credentials
>   Rate limit remaining: 59/60
>   Rate limit reset at: 2021-08-06 01:37:46 UTC
>   
> Execution halted
> Error: Process completed with exit code 1.
> {code}
> https://github.com/apache/spark/runs/3257853825



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36441) Downloading lintr dependencies fail on GA

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36441:


Assignee: (was: Apache Spark)

> Downloading lintr dependencies fail on GA
> -
>
> Key: SPARK-36441
> URL: https://issues.apache.org/jira/browse/SPARK-36441
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Priority: Major
>
> Downloading lintr dependencies on GA fails.
> I re-triggered the GA job but it still fail with the same error.
> {code}
>  * installing *source* package ‘devtools’ ...
> ** package ‘devtools’ successfully unpacked and MD5 sums checked
> ** using staged installation
> ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> *** copying figures
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (devtools)
> The downloaded source packages are in
>   ‘/tmp/Rtmpv53Ix4/downloaded_packages’
> Using bundled GitHub PAT. Please add your own PAT to the env var `GITHUB_PAT`
> Error: Failed to install 'unknown package' from GitHub:
>   HTTP error 401.
>   Bad credentials
>   Rate limit remaining: 59/60
>   Rate limit reset at: 2021-08-06 01:37:46 UTC
>   
> Execution halted
> Error: Process completed with exit code 1.
> {code}
> https://github.com/apache/spark/runs/3257853825



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36441) Downloading lintr dependencies fail on GA

2021-08-05 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-36441:
--

 Summary: Downloading lintr dependencies fail on GA
 Key: SPARK-36441
 URL: https://issues.apache.org/jira/browse/SPARK-36441
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.3.0
Reporter: Kousuke Saruta


Downloading lintr dependencies on GA fails.
I re-triggered the GA job but it still fail with the same error.

{code}
 * installing *source* package ‘devtools’ ...
** package ‘devtools’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (devtools)

The downloaded source packages are in
‘/tmp/Rtmpv53Ix4/downloaded_packages’
Using bundled GitHub PAT. Please add your own PAT to the env var `GITHUB_PAT`
Error: Failed to install 'unknown package' from GitHub:
  HTTP error 401.
  Bad credentials

  Rate limit remaining: 59/60
  Rate limit reset at: 2021-08-06 01:37:46 UTC

  
Execution halted
Error: Process completed with exit code 1.
{code}

https://github.com/apache/spark/runs/3257853825



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36440) Spark3 fails to read hive table with mixed format

2021-08-05 Thread Jason Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Xu updated SPARK-36440:
-
Description: 
Spark3 fails to read hive table with mixed format with hive Serde, this is a 
regression compares to Spark 2.4. 

Replication steps :
 1. In spark 3 (3.0 or 3.1) spark shell:
{code:java}
scala> spark.sql("create table tmp.test_table (id int, name string) partitioned 
by (pt int) stored as rcfile")

scala> spark.sql("insert into tmp.test_table (pt = 1) values (1, 'Alice'), (2, 
'Bob')")
{code}
2. Run hive command to change table file format (from RCFile to Parquet).
{code:java}
hive (default)> alter table set tmp.test_table fileformat Parquet;
{code}
3. Try to read partition (in RCFile format) with hive serde using Spark shell:
{code:java}
scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false")

scala> spark.sql("select * from tmp.test_table where pt=1").show{code}
Exception: (anonymized file path with )
{code:java}
Caused by: java.lang.RuntimeException: 
s3a:///data/part-0-22112178-5dd7-4065-89d7-2ee550296909-c000 is not a 
Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 96, 
1, -33]
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433)
  at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:79)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
  at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:285)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  {code}
 

 

 

  was:
Spark3 fails to read hive table with mixed format with hive Serde, this is a 
regression compares to Spark 2.4. 

Replication steps :
 1. In spark 3 (3.0 or 3.1) spark shell:
{code:java}
scala> spark.sql("create table tmp.test_table (id int, name string) partitioned 
by (pt int) stored as rcfile")

scala> spark.sql("insert into tmp.test_table (pt = 1) values (1, 'Alice'), (2, 
'Bob')")
{code}
2. Run hive command to change table format (from RCFile to Parquet).
{code:java}
hive (default)> alter table set tmp.test_table fileformat Parquet;
{code}
3. Try to read partition (in RCFile format) with hive serde using Spark shell:
{code:java}
scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false")

scala> spark.sql("select * from tmp.test_table where pt=1").show{code}
Exception: (anonymized file path with )
{code:java}
Caused by: java.lang.RuntimeException: 
s3a:///data/part-0-22112178-5dd7-4065-89d7-2ee550296909-c000 is not a 
Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 96, 
1, -33]
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433)
  at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:79)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
  at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:285)
  at 

[jira] [Updated] (SPARK-36440) Spark3 fails to read hive table with mixed format

2021-08-05 Thread Jason Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Xu updated SPARK-36440:
-
Description: 
Spark3 fails to read hive table with mixed format with hive Serde, this is a 
regression compares to Spark 2.4. 

Replication steps :
 1. In spark 3 (3.0 or 3.1) spark shell:
{code:java}
scala> spark.sql("create table tmp.test_table (id int, name string) partitioned 
by (pt int) stored as rcfile")

scala> spark.sql("insert into tmp.test_table (pt = 1) values (1, 'Alice'), (2, 
'Bob')")
{code}
2. Run hive command to change table format (from RCFile to Parquet).
{code:java}
hive (default)> alter table set tmp.test_table fileformat Parquet;
{code}
3. Try to read partition (in RCFile format) with hive serde using Spark shell:
{code:java}
scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false")

scala> spark.sql("select * from tmp.test_table where pt=1").show{code}
Exception: (anonymized file path with )
{code:java}
Caused by: java.lang.RuntimeException: 
s3a:///data/part-0-22112178-5dd7-4065-89d7-2ee550296909-c000 is not a 
Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 96, 
1, -33]
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433)
  at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:79)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
  at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:285)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  {code}
 

 

 

  was:
Spark3 fails to read hive table with mixed format with hive Serde, this is a 
regression compares to Spark 2.4. 

Replication steps :
1. In spark 3 (3.0 or 3.1) spark shell:
{code:java}
scala> spark.sql("create table tmp.test_table (id int, name string) partitioned 
by (pt int) stored as rcfile")

scala> spark.sql("insert into tmp.test_table (pt = 1) values (1, 'Alice'), (2, 
'Bob')"
{code}
2. Run hive command to change table format (from RCFile to Parquet).
{code:java}
hive (default)> alter table set tmp.test_table fileformat Parquet;
{code}
3. Try to read partition (in RCFile format) with hive serde using Spark shell:
{code:java}
scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false")

scala> spark.sql("select * from tmp.test_table where pt=1").show{code}
Exception: (anonymized file path with )
{code:java}
Caused by: java.lang.RuntimeException: 
s3a:///data/part-0-22112178-5dd7-4065-89d7-2ee550296909-c000 is not a 
Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 96, 
1, -33]
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433)
  at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:79)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
  at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:285)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
  at 

[jira] [Created] (SPARK-36440) Spark3 fails to read hive table with mixed format

2021-08-05 Thread Jason Xu (Jira)
Jason Xu created SPARK-36440:


 Summary: Spark3 fails to read hive table with mixed format
 Key: SPARK-36440
 URL: https://issues.apache.org/jira/browse/SPARK-36440
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.1.1, 3.0.0
Reporter: Jason Xu


Spark3 fails to read hive table with mixed format with hive Serde, this is a 
regression compares to Spark 2.4. 

Replication steps :
1. In spark 3 (3.0 or 3.1) spark shell:
{code:java}
scala> spark.sql("create table tmp.test_table (id int, name string) partitioned 
by (pt int) stored as rcfile")

scala> spark.sql("insert into tmp.test_table (pt = 1) values (1, 'Alice'), (2, 
'Bob')"
{code}
2. Run hive command to change table format (from RCFile to Parquet).
{code:java}
hive (default)> alter table set tmp.test_table fileformat Parquet;
{code}
3. Try to read partition (in RCFile format) with hive serde using Spark shell:
{code:java}
scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false")

scala> spark.sql("select * from tmp.test_table where pt=1").show{code}
Exception: (anonymized file path with )
{code:java}
Caused by: java.lang.RuntimeException: 
s3a:///data/part-0-22112178-5dd7-4065-89d7-2ee550296909-c000 is not a 
Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 96, 
1, -33]
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433)
  at 
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:79)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
  at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:285)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  {code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36439) Implement DataFrame.join on key column

2021-08-05 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36439:


 Summary: Implement DataFrame.join on key column
 Key: SPARK-36439
 URL: https://issues.apache.org/jira/browse/SPARK-36439
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36438) Support list-like Python objects for Series comparison

2021-08-05 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36438:


 Summary: Support list-like Python objects for Series comparison
 Key: SPARK-36438
 URL: https://issues.apache.org/jira/browse/SPARK-36438
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36437) Enable binary operations with list-like Python objects

2021-08-05 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36437:


 Summary: Enable binary operations with list-like Python objects
 Key: SPARK-36437
 URL: https://issues.apache.org/jira/browse/SPARK-36437
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36436) Implement 'weights' and 'axis' in sample at DataFrame and Series

2021-08-05 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36436:


 Summary: Implement 'weights' and 'axis' in sample at DataFrame and 
Series
 Key: SPARK-36436
 URL: https://issues.apache.org/jira/browse/SPARK-36436
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36435) Implement MultIndex.equal_levels

2021-08-05 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36435:


 Summary: Implement MultIndex.equal_levels
 Key: SPARK-36435
 URL: https://issues.apache.org/jira/browse/SPARK-36435
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36434) Implement DataFrame.lookup

2021-08-05 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36434:


 Summary: Implement DataFrame.lookup
 Key: SPARK-36434
 URL: https://issues.apache.org/jira/browse/SPARK-36434
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36433) Logs should show correct URL of where HistoryServer is started

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36433:


Assignee: (was: Apache Spark)

> Logs should show correct URL of where HistoryServer is started
> --
>
> Key: SPARK-36433
> URL: https://issues.apache.org/jira/browse/SPARK-36433
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Thejdeep Gudivada
>Priority: Major
>
> Due to a recent refactoring in the WebUI bind() code, the log message to 
> print the bound host and port information got moved and because of this the 
> info printed is incorrect.
>  
> Example log - 21/08/05 10:47:38 INFO HistoryServer: Bound HistoryServer to 
> 0.0.0.0, and started at :-1
>  
> Notice above that the port is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36433) Logs should show correct URL of where HistoryServer is started

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36433:


Assignee: Apache Spark

> Logs should show correct URL of where HistoryServer is started
> --
>
> Key: SPARK-36433
> URL: https://issues.apache.org/jira/browse/SPARK-36433
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Thejdeep Gudivada
>Assignee: Apache Spark
>Priority: Major
>
> Due to a recent refactoring in the WebUI bind() code, the log message to 
> print the bound host and port information got moved and because of this the 
> info printed is incorrect.
>  
> Example log - 21/08/05 10:47:38 INFO HistoryServer: Bound HistoryServer to 
> 0.0.0.0, and started at :-1
>  
> Notice above that the port is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36433) Logs should show correct URL of where HistoryServer is started

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394316#comment-17394316
 ] 

Apache Spark commented on SPARK-36433:
--

User 'thejdeep' has created a pull request for this issue:
https://github.com/apache/spark/pull/33659

> Logs should show correct URL of where HistoryServer is started
> --
>
> Key: SPARK-36433
> URL: https://issues.apache.org/jira/browse/SPARK-36433
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Thejdeep Gudivada
>Priority: Major
>
> Due to a recent refactoring in the WebUI bind() code, the log message to 
> print the bound host and port information got moved and because of this the 
> info printed is incorrect.
>  
> Example log - 21/08/05 10:47:38 INFO HistoryServer: Bound HistoryServer to 
> 0.0.0.0, and started at :-1
>  
> Notice above that the port is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36433) Logs should show correct URL of where HistoryServer is started

2021-08-05 Thread Thejdeep Gudivada (Jira)
Thejdeep Gudivada created SPARK-36433:
-

 Summary: Logs should show correct URL of where HistoryServer is 
started
 Key: SPARK-36433
 URL: https://issues.apache.org/jira/browse/SPARK-36433
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.2.0
Reporter: Thejdeep Gudivada


Due to a recent refactoring in the WebUI bind() code, the log message to print 
the bound host and port information got moved and because of this the info 
printed is incorrect.

 

Example log - 21/08/05 10:47:38 INFO HistoryServer: Bound HistoryServer to 
0.0.0.0, and started at :-1

 

Notice above that the port is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36393) Try to raise memory and parallelism again for GA

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394311#comment-17394311
 ] 

Apache Spark commented on SPARK-36393:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33658

> Try to raise memory and parallelism again for GA
> 
>
> Key: SPARK-36393
> URL: https://issues.apache.org/jira/browse/SPARK-36393
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> According to the feedback from GitHub, the change causing memory issue has 
> been rolled back. We can try to raise memory and parallelism again for GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36393) Try to raise memory and parallelism again for GA

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394308#comment-17394308
 ] 

Apache Spark commented on SPARK-36393:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33658

> Try to raise memory and parallelism again for GA
> 
>
> Key: SPARK-36393
> URL: https://issues.apache.org/jira/browse/SPARK-36393
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> According to the feedback from GitHub, the change causing memory issue has 
> been rolled back. We can try to raise memory and parallelism again for GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36393) Try to raise memory and parallelism again for GA

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394306#comment-17394306
 ] 

Apache Spark commented on SPARK-36393:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33657

> Try to raise memory and parallelism again for GA
> 
>
> Key: SPARK-36393
> URL: https://issues.apache.org/jira/browse/SPARK-36393
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> According to the feedback from GitHub, the change causing memory issue has 
> been rolled back. We can try to raise memory and parallelism again for GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36393) Try to raise memory and parallelism again for GA

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394305#comment-17394305
 ] 

Apache Spark commented on SPARK-36393:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33657

> Try to raise memory and parallelism again for GA
> 
>
> Key: SPARK-36393
> URL: https://issues.apache.org/jira/browse/SPARK-36393
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> According to the feedback from GitHub, the change causing memory issue has 
> been rolled back. We can try to raise memory and parallelism again for GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36432) Upgrade Jetty version to 9.4.43

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394290#comment-17394290
 ] 

Apache Spark commented on SPARK-36432:
--

User 'this' has created a pull request for this issue:
https://github.com/apache/spark/pull/33656

> Upgrade Jetty version to 9.4.43
> ---
>
> Key: SPARK-36432
> URL: https://issues.apache.org/jira/browse/SPARK-36432
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Sajith A
>Priority: Minor
>
> Upgrade Jetty version to 9.4.43.v20210629 in current Spark master in order to 
> fix vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-34429.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36432) Upgrade Jetty version to 9.4.43

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36432:


Assignee: Apache Spark

> Upgrade Jetty version to 9.4.43
> ---
>
> Key: SPARK-36432
> URL: https://issues.apache.org/jira/browse/SPARK-36432
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Sajith A
>Assignee: Apache Spark
>Priority: Minor
>
> Upgrade Jetty version to 9.4.43.v20210629 in current Spark master in order to 
> fix vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-34429.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36432) Upgrade Jetty version to 9.4.43

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394289#comment-17394289
 ] 

Apache Spark commented on SPARK-36432:
--

User 'this' has created a pull request for this issue:
https://github.com/apache/spark/pull/33656

> Upgrade Jetty version to 9.4.43
> ---
>
> Key: SPARK-36432
> URL: https://issues.apache.org/jira/browse/SPARK-36432
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Sajith A
>Priority: Minor
>
> Upgrade Jetty version to 9.4.43.v20210629 in current Spark master in order to 
> fix vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-34429.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36432) Upgrade Jetty version to 9.4.43

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36432:


Assignee: (was: Apache Spark)

> Upgrade Jetty version to 9.4.43
> ---
>
> Key: SPARK-36432
> URL: https://issues.apache.org/jira/browse/SPARK-36432
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Sajith A
>Priority: Minor
>
> Upgrade Jetty version to 9.4.43.v20210629 in current Spark master in order to 
> fix vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-34429.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36432) Upgrade Jetty version to 9.4.43

2021-08-05 Thread Sajith A (Jira)
Sajith A created SPARK-36432:


 Summary: Upgrade Jetty version to 9.4.43
 Key: SPARK-36432
 URL: https://issues.apache.org/jira/browse/SPARK-36432
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.3.0
Reporter: Sajith A


Upgrade Jetty version to 9.4.43.v20210629 in current Spark master in order to 
fix vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-34429.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2021-08-05 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394265#comment-17394265
 ] 

Holden Karau commented on SPARK-24815:
--

cc [~tdas] for thoughts?

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36431) Support comparison of ANSI intervals with different fields

2021-08-05 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394235#comment-17394235
 ] 

angerszhu commented on SPARK-36431:
---

Working on this 

> Support comparison of ANSI intervals with different fields
> --
>
> Key: SPARK-36431
> URL: https://issues.apache.org/jira/browse/SPARK-36431
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Support comparison of
> - a day-time interval with another day-time interval which has different 
> fields
> - a year-month interval with another year-month interval where fields are 
> different.
> The example below shows the issue:
> {code:sql}
> spark-sql> select interval '1' day > interval '1' hour;
> Error in query: cannot resolve '(INTERVAL '1' DAY > INTERVAL '01' HOUR)' due 
> to data type mismatch: differing types in '(INTERVAL '1' DAY > INTERVAL '01' 
> HOUR)' (interval day and interval hour).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '1' DAY > INTERVAL '01' HOUR), None)]
> +- OneRowRelation
> spark-sql> select interval '2' year > interval '11' month;
> Error in query: cannot resolve '(INTERVAL '2' YEAR > INTERVAL '11' MONTH)' 
> due to data type mismatch: differing types in '(INTERVAL '2' YEAR > INTERVAL 
> '11' MONTH)' (interval year and interval month).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '2' YEAR > INTERVAL '11' MONTH), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36431) Support comparison of ANSI intervals with different fields

2021-08-05 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-36431:
-
Summary: Support comparison of ANSI intervals with different fields  (was: 
Support comparison of ANSI interval with different fields)

> Support comparison of ANSI intervals with different fields
> --
>
> Key: SPARK-36431
> URL: https://issues.apache.org/jira/browse/SPARK-36431
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Support comparison of
> - a day-time interval with another day-time interval which has different 
> fields
> - a year-month interval with another year-month interval where fields are 
> different.
> The example below shows the issue:
> {code:sql}
> spark-sql> select interval '1' day > interval '1' hour;
> Error in query: cannot resolve '(INTERVAL '1' DAY > INTERVAL '01' HOUR)' due 
> to data type mismatch: differing types in '(INTERVAL '1' DAY > INTERVAL '01' 
> HOUR)' (interval day and interval hour).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '1' DAY > INTERVAL '01' HOUR), None)]
> +- OneRowRelation
> spark-sql> select interval '2' year > interval '11' month;
> Error in query: cannot resolve '(INTERVAL '2' YEAR > INTERVAL '11' MONTH)' 
> due to data type mismatch: differing types in '(INTERVAL '2' YEAR > INTERVAL 
> '11' MONTH)' (interval year and interval month).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '2' YEAR > INTERVAL '11' MONTH), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36431) Support comparison of ANSI interval with different fields

2021-08-05 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394230#comment-17394230
 ] 

Max Gekk commented on SPARK-36431:
--

FYI [~cloud_fan] and [~sarutak] [~angerszhuuu] [~beliefer] Please, leave a 
comment here if you would like to work on this.

> Support comparison of ANSI interval with different fields
> -
>
> Key: SPARK-36431
> URL: https://issues.apache.org/jira/browse/SPARK-36431
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Support comparison of
> - a day-time interval with another day-time interval which has different 
> fields
> - a year-month interval with another year-month interval where fields are 
> different.
> The example below shows the issue:
> {code:sql}
> spark-sql> select interval '1' day > interval '1' hour;
> Error in query: cannot resolve '(INTERVAL '1' DAY > INTERVAL '01' HOUR)' due 
> to data type mismatch: differing types in '(INTERVAL '1' DAY > INTERVAL '01' 
> HOUR)' (interval day and interval hour).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '1' DAY > INTERVAL '01' HOUR), None)]
> +- OneRowRelation
> spark-sql> select interval '2' year > interval '11' month;
> Error in query: cannot resolve '(INTERVAL '2' YEAR > INTERVAL '11' MONTH)' 
> due to data type mismatch: differing types in '(INTERVAL '2' YEAR > INTERVAL 
> '11' MONTH)' (interval year and interval month).; line 1 pos 7;
> 'Project [unresolvedalias((INTERVAL '2' YEAR > INTERVAL '11' MONTH), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36431) Support comparison of ANSI interval with different fields

2021-08-05 Thread Max Gekk (Jira)
Max Gekk created SPARK-36431:


 Summary: Support comparison of ANSI interval with different fields
 Key: SPARK-36431
 URL: https://issues.apache.org/jira/browse/SPARK-36431
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk


Support comparison of
- a day-time interval with another day-time interval which has different fields
- a year-month interval with another year-month interval where fields are 
different.

The example below shows the issue:

{code:sql}
spark-sql> select interval '1' day > interval '1' hour;
Error in query: cannot resolve '(INTERVAL '1' DAY > INTERVAL '01' HOUR)' due to 
data type mismatch: differing types in '(INTERVAL '1' DAY > INTERVAL '01' 
HOUR)' (interval day and interval hour).; line 1 pos 7;
'Project [unresolvedalias((INTERVAL '1' DAY > INTERVAL '01' HOUR), None)]
+- OneRowRelation

spark-sql> select interval '2' year > interval '11' month;
Error in query: cannot resolve '(INTERVAL '2' YEAR > INTERVAL '11' MONTH)' due 
to data type mismatch: differing types in '(INTERVAL '2' YEAR > INTERVAL '11' 
MONTH)' (interval year and interval month).; line 1 pos 7;
'Project [unresolvedalias((INTERVAL '2' YEAR > INTERVAL '11' MONTH), None)]
+- OneRowRelation
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36414) Disable timeout for BroadcastQueryStageExec in AQE

2021-08-05 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394140#comment-17394140
 ] 

Dongjoon Hyun commented on SPARK-36414:
---

Thank you, [~Qin Yao] and [~cloud_fan]. I collect this to SPARK-33828 to give a 
more visibility.

> Disable timeout for BroadcastQueryStageExec in AQE
> --
>
> Key: SPARK-36414
> URL: https://issues.apache.org/jira/browse/SPARK-36414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2021-08-04-18-53-44-879.png
>
>
> This reverts SPARK-31475, as there are always more concurrent jobs running in 
> AQE mode, especially when running multiple queries at the same time. 
> Currently, the broadcast timeout does not record accurately for the 
> BroadcastQueryStageExec only but also the time waiting for being scheduled. 
> If all the resources are currently being occupied for materializing other 
> stages, it timeouts without a chance to run actually.
>  
> !image-2021-08-04-18-53-44-879.png!
>  
> The default value is 300s, and it's hard to adjust the timeout for AQE mode. 
> Usually, you need an extremely large number for real-world cases. As you can 
> see the example, above, the timeout we used for it is 1800s, and obviously, 
> it needs 3x more or something
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36414) Disable timeout for BroadcastQueryStageExec in AQE

2021-08-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36414:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Improvement)

> Disable timeout for BroadcastQueryStageExec in AQE
> --
>
> Key: SPARK-36414
> URL: https://issues.apache.org/jira/browse/SPARK-36414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2021-08-04-18-53-44-879.png
>
>
> This reverts SPARK-31475, as there are always more concurrent jobs running in 
> AQE mode, especially when running multiple queries at the same time. 
> Currently, the broadcast timeout does not record accurately for the 
> BroadcastQueryStageExec only but also the time waiting for being scheduled. 
> If all the resources are currently being occupied for materializing other 
> stages, it timeouts without a chance to run actually.
>  
> !image-2021-08-04-18-53-44-879.png!
>  
> The default value is 300s, and it's hard to adjust the timeout for AQE mode. 
> Usually, you need an extremely large number for real-world cases. As you can 
> see the example, above, the timeout we used for it is 1800s, and obviously, 
> it needs 3x more or something
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36430) Adaptively calculate the target size when coalescing shuffle partitions in AQE

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36430:


Assignee: (was: Apache Spark)

> Adaptively calculate the target size when coalescing shuffle partitions in AQE
> --
>
> Key: SPARK-36430
> URL: https://issues.apache.org/jira/browse/SPARK-36430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36430) Adaptively calculate the target size when coalescing shuffle partitions in AQE

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394136#comment-17394136
 ] 

Apache Spark commented on SPARK-36430:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33655

> Adaptively calculate the target size when coalescing shuffle partitions in AQE
> --
>
> Key: SPARK-36430
> URL: https://issues.apache.org/jira/browse/SPARK-36430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36430) Adaptively calculate the target size when coalescing shuffle partitions in AQE

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36430:


Assignee: Apache Spark

> Adaptively calculate the target size when coalescing shuffle partitions in AQE
> --
>
> Key: SPARK-36430
> URL: https://issues.apache.org/jira/browse/SPARK-36430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36430) Adaptively calculate the target size when coalescing shuffle partitions in AQE

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394134#comment-17394134
 ] 

Apache Spark commented on SPARK-36430:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33655

> Adaptively calculate the target size when coalescing shuffle partitions in AQE
> --
>
> Key: SPARK-36430
> URL: https://issues.apache.org/jira/browse/SPARK-36430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36430) Adaptively calculate the target size when coalescing shuffle partitions in AQE

2021-08-05 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-36430:
---

 Summary: Adaptively calculate the target size when coalescing 
shuffle partitions in AQE
 Key: SPARK-36430
 URL: https://issues.apache.org/jira/browse/SPARK-36430
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36426) Pin pyzmq to 2.22.0

2021-08-05 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-36426.

Resolution: Invalid

I re-triggered the failed GA job and it finally succeeded so I'll close this 
issue.

> Pin pyzmq to 2.22.0
> ---
>
> Key: SPARK-36426
> URL: https://issues.apache.org/jira/browse/SPARK-36426
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> GA is failing and the root cause seems the latest release of pyzmq, which is 
> released a few hours ago.
> https://github.com/apache/spark/runs/3250261989#step:11:414
> https://github.com/apache/spark/runs/3250252645?check_suite_focus=true#step:11:417
> https://pypi.org/project/pyzmq/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36414) Disable timeout for BroadcastQueryStageExec in AQE

2021-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36414.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33636
[https://github.com/apache/spark/pull/33636]

> Disable timeout for BroadcastQueryStageExec in AQE
> --
>
> Key: SPARK-36414
> URL: https://issues.apache.org/jira/browse/SPARK-36414
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2021-08-04-18-53-44-879.png
>
>
> This reverts SPARK-31475, as there are always more concurrent jobs running in 
> AQE mode, especially when running multiple queries at the same time. 
> Currently, the broadcast timeout does not record accurately for the 
> BroadcastQueryStageExec only but also the time waiting for being scheduled. 
> If all the resources are currently being occupied for materializing other 
> stages, it timeouts without a chance to run actually.
>  
> !image-2021-08-04-18-53-44-879.png!
>  
> The default value is 300s, and it's hard to adjust the timeout for AQE mode. 
> Usually, you need an extremely large number for real-world cases. As you can 
> see the example, above, the timeout we used for it is 1800s, and obviously, 
> it needs 3x more or something
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36414) Disable timeout for BroadcastQueryStageExec in AQE

2021-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36414:
---

Assignee: Kent Yao

> Disable timeout for BroadcastQueryStageExec in AQE
> --
>
> Key: SPARK-36414
> URL: https://issues.apache.org/jira/browse/SPARK-36414
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Attachments: image-2021-08-04-18-53-44-879.png
>
>
> This reverts SPARK-31475, as there are always more concurrent jobs running in 
> AQE mode, especially when running multiple queries at the same time. 
> Currently, the broadcast timeout does not record accurately for the 
> BroadcastQueryStageExec only but also the time waiting for being scheduled. 
> If all the resources are currently being occupied for materializing other 
> stages, it timeouts without a chance to run actually.
>  
> !image-2021-08-04-18-53-44-879.png!
>  
> The default value is 300s, and it's hard to adjust the timeout for AQE mode. 
> Usually, you need an extremely large number for real-world cases. As you can 
> see the example, above, the timeout we used for it is 1800s, and obviously, 
> it needs 3x more or something
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36409) Splitting test cases from datetime.sql

2021-08-05 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36409.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33640
[https://github.com/apache/spark/pull/33640]

> Splitting test cases from datetime.sql
> --
>
> Key: SPARK-36409
> URL: https://issues.apache.org/jira/browse/SPARK-36409
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.3.0
>
>
> Split the test cases related to timestamp_ntz or timestamp_ltz functions from 
> datetime.sql. This is to reduce the size of datetime.sql, which has around 
> 300 cases and will increase in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36409) Splitting test cases from datetime.sql

2021-08-05 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-36409:
--

Assignee: Wenchen Fan  (was: Gengliang Wang)

> Splitting test cases from datetime.sql
> --
>
> Key: SPARK-36409
> URL: https://issues.apache.org/jira/browse/SPARK-36409
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 3.3.0
>
>
> Split the test cases related to timestamp_ntz or timestamp_ltz functions from 
> datetime.sql. This is to reduce the size of datetime.sql, which has around 
> 300 cases and will increase in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36353) RemoveNoopOperators should keep output schema

2021-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36353:
---

Assignee: angerszhu

> RemoveNoopOperators should keep output schema
> -
>
> Key: SPARK-36353
> URL: https://issues.apache.org/jira/browse/SPARK-36353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Attachments: image-2021-07-30-17-46-59-196.png
>
>
> !image-2021-07-30-17-46-59-196.png|width=539,height=220!
> [https://github.com/apache/spark/pull/33587]
>  
> Only first level?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36353) RemoveNoopOperators should keep output schema

2021-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36353.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33587
[https://github.com/apache/spark/pull/33587]

> RemoveNoopOperators should keep output schema
> -
>
> Key: SPARK-36353
> URL: https://issues.apache.org/jira/browse/SPARK-36353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2021-07-30-17-46-59-196.png
>
>
> !image-2021-07-30-17-46-59-196.png|width=539,height=220!
> [https://github.com/apache/spark/pull/33587]
>  
> Only first level?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36429) JacksonParser should throw exception when data type unsupported.

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36429:


Assignee: (was: Apache Spark)

> JacksonParser should throw exception when data type unsupported.
> 
>
> Key: SPARK-36429
> URL: https://issues.apache.org/jira/browse/SPARK-36429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
> different between from_json and from_csv.
> {code:java}
> -- !query
> select from_json('{"t":"26/October/2015"}', 't Timestamp', 
> map('timestampFormat', 'dd/M/'))
> -- !query schema
> struct>
> -- !query output
> {"t":null}
> {code}
> {code:java}
> -- !query
> select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
> 'dd/M/'))
> -- !query schema
> struct<>
> -- !query output
> java.lang.Exception
> Unsupported type: timestamp_ntz
> {code}
> We should make from_json throws exception too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36429) JacksonParser should throw exception when data type unsupported.

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36429:


Assignee: Apache Spark

> JacksonParser should throw exception when data type unsupported.
> 
>
> Key: SPARK-36429
> URL: https://issues.apache.org/jira/browse/SPARK-36429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
> different between from_json and from_csv.
> {code:java}
> -- !query
> select from_json('{"t":"26/October/2015"}', 't Timestamp', 
> map('timestampFormat', 'dd/M/'))
> -- !query schema
> struct>
> -- !query output
> {"t":null}
> {code}
> {code:java}
> -- !query
> select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
> 'dd/M/'))
> -- !query schema
> struct<>
> -- !query output
> java.lang.Exception
> Unsupported type: timestamp_ntz
> {code}
> We should make from_json throws exception too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36429) JacksonParser should throw exception when data type unsupported.

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393891#comment-17393891
 ] 

Apache Spark commented on SPARK-36429:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/33654

> JacksonParser should throw exception when data type unsupported.
> 
>
> Key: SPARK-36429
> URL: https://issues.apache.org/jira/browse/SPARK-36429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
> different between from_json and from_csv.
> {code:java}
> -- !query
> select from_json('{"t":"26/October/2015"}', 't Timestamp', 
> map('timestampFormat', 'dd/M/'))
> -- !query schema
> struct>
> -- !query output
> {"t":null}
> {code}
> {code:java}
> -- !query
> select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
> 'dd/M/'))
> -- !query schema
> struct<>
> -- !query output
> java.lang.Exception
> Unsupported type: timestamp_ntz
> {code}
> We should make from_json throws exception too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36429) JacksonParser should throw exception when data type unsupported.

2021-08-05 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393828#comment-17393828
 ] 

jiaan.geng commented on SPARK-36429:


I'm working on.

> JacksonParser should throw exception when data type unsupported.
> 
>
> Key: SPARK-36429
> URL: https://issues.apache.org/jira/browse/SPARK-36429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
> different between from_json and from_csv.
> {code:java}
> -- !query
> select from_json('{"t":"26/October/2015"}', 't Timestamp', 
> map('timestampFormat', 'dd/M/'))
> -- !query schema
> struct>
> -- !query output
> {"t":null}
> {code}
> {code:java}
> -- !query
> select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
> 'dd/M/'))
> -- !query schema
> struct<>
> -- !query output
> java.lang.Exception
> Unsupported type: timestamp_ntz
> {code}
> We should make from_json throws exception too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36429) JacksonParser should throw exception when data type unsupported.

2021-08-05 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-36429:
---
Description: 
Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
different between from_json and from_csv.

{code:java}
-- !query
select from_json('{"t":"26/October/2015"}', 't Timestamp', 
map('timestampFormat', 'dd/M/'))
-- !query schema
struct>
-- !query output
{"t":null}
{code}



-- !query
select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
'dd/M/'))
-- !query schema
struct<>
-- !query output
java.lang.Exception
Unsupported type: timestamp_ntz

We should make from_json throws exception too.

  was:
Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
different between from_json and from_csv.
-- !query
select from_json('{"t":"26/October/2015"}', 't Timestamp', 
map('timestampFormat', 'dd/M/'))
-- !query schema
struct>
-- !query output
{"t":null}


-- !query
select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
'dd/M/'))
-- !query schema
struct<>
-- !query output
java.lang.Exception
Unsupported type: timestamp_ntz

We should make from_json throws exception too.


> JacksonParser should throw exception when data type unsupported.
> 
>
> Key: SPARK-36429
> URL: https://issues.apache.org/jira/browse/SPARK-36429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
> different between from_json and from_csv.
> {code:java}
> -- !query
> select from_json('{"t":"26/October/2015"}', 't Timestamp', 
> map('timestampFormat', 'dd/M/'))
> -- !query schema
> struct>
> -- !query output
> {"t":null}
> {code}
> -- !query
> select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
> 'dd/M/'))
> -- !query schema
> struct<>
> -- !query output
> java.lang.Exception
> Unsupported type: timestamp_ntz
> We should make from_json throws exception too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36429) JacksonParser should throw exception when data type unsupported.

2021-08-05 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-36429:
---
Description: 
Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
different between from_json and from_csv.

{code:java}
-- !query
select from_json('{"t":"26/October/2015"}', 't Timestamp', 
map('timestampFormat', 'dd/M/'))
-- !query schema
struct>
-- !query output
{"t":null}
{code}




{code:java}
-- !query
select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
'dd/M/'))
-- !query schema
struct<>
-- !query output
java.lang.Exception
Unsupported type: timestamp_ntz
{code}


We should make from_json throws exception too.

  was:
Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
different between from_json and from_csv.

{code:java}
-- !query
select from_json('{"t":"26/October/2015"}', 't Timestamp', 
map('timestampFormat', 'dd/M/'))
-- !query schema
struct>
-- !query output
{"t":null}
{code}



-- !query
select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
'dd/M/'))
-- !query schema
struct<>
-- !query output
java.lang.Exception
Unsupported type: timestamp_ntz

We should make from_json throws exception too.


> JacksonParser should throw exception when data type unsupported.
> 
>
> Key: SPARK-36429
> URL: https://issues.apache.org/jira/browse/SPARK-36429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
> different between from_json and from_csv.
> {code:java}
> -- !query
> select from_json('{"t":"26/October/2015"}', 't Timestamp', 
> map('timestampFormat', 'dd/M/'))
> -- !query schema
> struct>
> -- !query output
> {"t":null}
> {code}
> {code:java}
> -- !query
> select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
> 'dd/M/'))
> -- !query schema
> struct<>
> -- !query output
> java.lang.Exception
> Unsupported type: timestamp_ntz
> {code}
> We should make from_json throws exception too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36429) JacksonParser should throw exception when data type unsupported.

2021-08-05 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-36429:
--

 Summary: JacksonParser should throw exception when data type 
unsupported.
 Key: SPARK-36429
 URL: https://issues.apache.org/jira/browse/SPARK-36429
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: jiaan.geng


Currently, when set spark.sql.timestampType=TIMESTAMP_NTZ, the behavior is 
different between from_json and from_csv.
-- !query
select from_json('{"t":"26/October/2015"}', 't Timestamp', 
map('timestampFormat', 'dd/M/'))
-- !query schema
struct>
-- !query output
{"t":null}


-- !query
select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 
'dd/M/'))
-- !query schema
struct<>
-- !query output
java.lang.Exception
Unsupported type: timestamp_ntz

We should make from_json throws exception too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36355) NamedExpression add method `withName(newName: String)`

2021-08-05 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-36355.
---
Resolution: Not A Problem

> NamedExpression add method `withName(newName: String)`
> --
>
> Key: SPARK-36355
> URL: https://issues.apache.org/jira/browse/SPARK-36355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36428) the 'seconds' parameter of 'make_timestamp' should accept integer type

2021-08-05 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393818#comment-17393818
 ] 

jiaan.geng commented on SPARK-36428:


I will take a look.

> the 'seconds' parameter of 'make_timestamp' should accept integer type
> --
>
> Key: SPARK-36428
> URL: https://issues.apache.org/jira/browse/SPARK-36428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>
> With ANSI mode, {{SELECT make_timestamp(1, 1, 1, 1, 1, 1)}} fails, because 
> the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be 
> implicitly casted to DECIMAL(8,6) under ANSI mode.
> We should update the function {{make_timestamp}} to allow integer type 
> 'seconds' parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36428) the 'second' parameter of 'make_timestamp' should accept integer type

2021-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-36428:

Description: 
With ANSI mode, {{SELECT make_timestamp(1, 1, 1, 1, 1, 1)}} fails, because the 
'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be 
implicitly casted to DECIMAL(8,6) under ANSI mode.

We should update the function {{make_timestamp}} to allow integer type 'seconds'

> the 'second' parameter of 'make_timestamp' should accept integer type
> -
>
> Key: SPARK-36428
> URL: https://issues.apache.org/jira/browse/SPARK-36428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>
> With ANSI mode, {{SELECT make_timestamp(1, 1, 1, 1, 1, 1)}} fails, because 
> the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be 
> implicitly casted to DECIMAL(8,6) under ANSI mode.
> We should update the function {{make_timestamp}} to allow integer type 
> 'seconds'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36428) the 'seconds' parameter of 'make_timestamp' should accept integer type

2021-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-36428:

Summary: the 'seconds' parameter of 'make_timestamp' should accept integer 
type  (was: the 'second' parameter of 'make_timestamp' should accept integer 
type)

> the 'seconds' parameter of 'make_timestamp' should accept integer type
> --
>
> Key: SPARK-36428
> URL: https://issues.apache.org/jira/browse/SPARK-36428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>
> With ANSI mode, {{SELECT make_timestamp(1, 1, 1, 1, 1, 1)}} fails, because 
> the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be 
> implicitly casted to DECIMAL(8,6) under ANSI mode.
> We should update the function {{make_timestamp}} to allow integer type 
> 'seconds'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36428) the 'seconds' parameter of 'make_timestamp' should accept integer type

2021-08-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-36428:

Description: 
With ANSI mode, {{SELECT make_timestamp(1, 1, 1, 1, 1, 1)}} fails, because the 
'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be 
implicitly casted to DECIMAL(8,6) under ANSI mode.

We should update the function {{make_timestamp}} to allow integer type 
'seconds' parameter.

  was:
With ANSI mode, {{SELECT make_timestamp(1, 1, 1, 1, 1, 1)}} fails, because the 
'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be 
implicitly casted to DECIMAL(8,6) under ANSI mode.

We should update the function {{make_timestamp}} to allow integer type 'seconds'


> the 'seconds' parameter of 'make_timestamp' should accept integer type
> --
>
> Key: SPARK-36428
> URL: https://issues.apache.org/jira/browse/SPARK-36428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>
> With ANSI mode, {{SELECT make_timestamp(1, 1, 1, 1, 1, 1)}} fails, because 
> the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be 
> implicitly casted to DECIMAL(8,6) under ANSI mode.
> We should update the function {{make_timestamp}} to allow integer type 
> 'seconds' parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36428) the 'second' parameter of 'make_timestamp' should accept integer type

2021-08-05 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-36428:
---

 Summary: the 'second' parameter of 'make_timestamp' should accept 
integer type
 Key: SPARK-36428
 URL: https://issues.apache.org/jira/browse/SPARK-36428
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36426) Pin pyzmq to 2.22.0

2021-08-05 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36426:
---
Description: 
GA is failing and the root cause seems the latest release of pyzmq, which is 
released a few hours ago.
https://github.com/apache/spark/runs/3250261989#step:11:414
https://github.com/apache/spark/runs/3250252645?check_suite_focus=true#step:11:417
https://pypi.org/project/pyzmq/

  was:
GA is failing and the root cause seems the latest release of pyzmq, which is 
released a few hours ago.
https://pypi.org/project/pyzmq/


> Pin pyzmq to 2.22.0
> ---
>
> Key: SPARK-36426
> URL: https://issues.apache.org/jira/browse/SPARK-36426
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> GA is failing and the root cause seems the latest release of pyzmq, which is 
> released a few hours ago.
> https://github.com/apache/spark/runs/3250261989#step:11:414
> https://github.com/apache/spark/runs/3250252645?check_suite_focus=true#step:11:417
> https://pypi.org/project/pyzmq/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36426) Pin pyzmq to 2.22.0

2021-08-05 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36426:
---
Description: 
GA is failing and the root cause seems the latest release of pyzmq, which is 
released a few hours ago.
https://pypi.org/project/pyzmq/

  was:
GA is failing and the root cause seems the latest release of pyzmq.
https://pypi.org/project/pyzmq/


> Pin pyzmq to 2.22.0
> ---
>
> Key: SPARK-36426
> URL: https://issues.apache.org/jira/browse/SPARK-36426
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> GA is failing and the root cause seems the latest release of pyzmq, which is 
> released a few hours ago.
> https://pypi.org/project/pyzmq/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36426) Pin pyzmq to 2.22.0

2021-08-05 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36426:
---
Description: 
GA is failing and the root cause seems the latest release of pyzmq.
https://pypi.org/project/pyzmq/

  was:
This is a hotfix PR to recover GA.
See https://github.com/apache/spark/runs/3250261989

The root cause seems the latest release of pyzmq.
https://pypi.org/project/pyzmq/


> Pin pyzmq to 2.22.0
> ---
>
> Key: SPARK-36426
> URL: https://issues.apache.org/jira/browse/SPARK-36426
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> GA is failing and the root cause seems the latest release of pyzmq.
> https://pypi.org/project/pyzmq/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36426) Pin pyzmq to 2.22.0

2021-08-05 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36426:
---
Affects Version/s: 3.2.0

> Pin pyzmq to 2.22.0
> ---
>
> Key: SPARK-36426
> URL: https://issues.apache.org/jira/browse/SPARK-36426
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> This is a hotfix PR to recover GA.
> See https://github.com/apache/spark/runs/3250261989
> The root cause seems the latest release of pyzmq.
> https://pypi.org/project/pyzmq/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36425) PySpark: support CrossValidatorModel get standard deviation of metrics for each paramMap

2021-08-05 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-36425:
--

Assignee: Weichen Xu

> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap 
> -
>
> Key: SPARK-36425
> URL: https://issues.apache.org/jira/browse/SPARK-36425
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36426) Pin pyzmq to 2.22.0

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393743#comment-17393743
 ] 

Apache Spark commented on SPARK-36426:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33653

> Pin pyzmq to 2.22.0
> ---
>
> Key: SPARK-36426
> URL: https://issues.apache.org/jira/browse/SPARK-36426
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> This is a hotfix PR to recover GA.
> See https://github.com/apache/spark/runs/3250261989
> The root cause seems the latest release of pyzmq.
> https://pypi.org/project/pyzmq/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36427) Scala API: support CrossValidatorModel get standard deviation of metrics for each paramMap

2021-08-05 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-36427:
--

 Summary: Scala API: support CrossValidatorModel get standard 
deviation of metrics for each paramMap
 Key: SPARK-36427
 URL: https://issues.apache.org/jira/browse/SPARK-36427
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.2.0
Reporter: Weichen Xu


This is the parity feature of https://issues.apache.org/jira/browse/SPARK-36425

Note:
We need also update PySpark CrossValidatorModel.to_java/from_java methods in 
this task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36426) Pin pyzmq to 2.22.0

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36426:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Pin pyzmq to 2.22.0
> ---
>
> Key: SPARK-36426
> URL: https://issues.apache.org/jira/browse/SPARK-36426
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> This is a hotfix PR to recover GA.
> See https://github.com/apache/spark/runs/3250261989
> The root cause seems the latest release of pyzmq.
> https://pypi.org/project/pyzmq/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36426) Pin pyzmq to 2.22.0

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36426:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Pin pyzmq to 2.22.0
> ---
>
> Key: SPARK-36426
> URL: https://issues.apache.org/jira/browse/SPARK-36426
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> This is a hotfix PR to recover GA.
> See https://github.com/apache/spark/runs/3250261989
> The root cause seems the latest release of pyzmq.
> https://pypi.org/project/pyzmq/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36426) Pin pyzmq to 2.22.0

2021-08-05 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-36426:
--

 Summary: Pin pyzmq to 2.22.0
 Key: SPARK-36426
 URL: https://issues.apache.org/jira/browse/SPARK-36426
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


This is a hotfix PR to recover GA.
See https://github.com/apache/spark/runs/3250261989

The root cause seems the latest release of pyzmq.
https://pypi.org/project/pyzmq/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36425) PySpark: support CrossValidatorModel get standard deviation of metrics for each paramMap

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393740#comment-17393740
 ] 

Apache Spark commented on SPARK-36425:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/33652

> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap 
> -
>
> Key: SPARK-36425
> URL: https://issues.apache.org/jira/browse/SPARK-36425
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36425) PySpark: support CrossValidatorModel get standard deviation of metrics for each paramMap

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36425:


Assignee: (was: Apache Spark)

> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap 
> -
>
> Key: SPARK-36425
> URL: https://issues.apache.org/jira/browse/SPARK-36425
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36425) PySpark: support CrossValidatorModel get standard deviation of metrics for each paramMap

2021-08-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36425:


Assignee: Apache Spark

> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap 
> -
>
> Key: SPARK-36425
> URL: https://issues.apache.org/jira/browse/SPARK-36425
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>
> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36425) PySpark: support CrossValidatorModel get standard deviation of metrics for each paramMap

2021-08-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393739#comment-17393739
 ] 

Apache Spark commented on SPARK-36425:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/33652

> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap 
> -
>
> Key: SPARK-36425
> URL: https://issues.apache.org/jira/browse/SPARK-36425
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> PySpark: support CrossValidatorModel get standard deviation of metrics for 
> each paramMap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >