[jira] [Updated] (SPARK-34985) Different execution plans under jdbc and hdfs

2021-04-07 Thread lianjunzhi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lianjunzhi updated SPARK-34985:
---
Description: 
Hive has two non-partitioned tables, trade_order and trade_order_goods. 
Trade_order contains four fields: trade_id, company_id, is_delete, and 
trade_status. trade_order_goods contains four fields: trade_id, cost, 
is_delete, and sell_total. Run the following code snippets:

 
|val df = spark.sql(|
|"""|
|select|
|b.company_id,|
|sum(a.cost) cost|
|FROM oms.trade_order_goods a|
|JOIN oms.trade_order b|
|ON a.trade_id = b.trade_id|
|WHERE a.is_delete = 0 AND b.is_delete = 0|
|GROUP BY|
|b.company_id|
|""".stripMargin)|

 
{quote}df.explain() //Physical Plan 1
{quote}
{quote}df.write.insertInto("oms.test") //Physical Plan 2
{quote}
{quote}df.write
 .format("jdbc")
 .option("url", "")
 .option("dbtable", "test")
 .option("user", "")
 .option("password", "")
 .option("driver", "com.mysql.jdbc.Driver")
 .option("truncate", value = true)
 .option("batchsize", 15000)
 .mode(SaveMode.Append)
 .save() //Physical Plan 3
{quote}
Physical Plan 1:
{quote}AdaptiveSparkPlan isFinalPlan=false
 +- HashAggregate(keys=[company_id#6L|#6L], functions=[sum(cost#2)|#2)])
 +- Exchange hashpartitioning(company_id#6L, 6), true, [id=#40|#40]
 +- HashAggregate(keys=[company_id#6L|#6L], functions=[partial_sum(cost#2)|#2)])
 +- Project [cost#2, company_id#6L|#2, company_id#6L]
 +- SortMergeJoin [trade_id#1L|#1L], [trade_id#5L|#5L], Inner
 :- Sort [trade_id#1L ASC NULLS FIRST|#1L ASC NULLS FIRST], false, 0
 : +- Exchange hashpartitioning(trade_id#1L, 6), true, [id=#32|#32]
 : +- Project [trade_id#1L, cost#2|#1L, cost#2]
 : +- Filter ((isnotnull(is_delete#3) AND (is_delete#3 = 0)) AND 
isnotnull(trade_id#1L))
 : +- FileScan parquet 
oms.trade_order_goods[trade_id#1L,cost#2,is_delete#3|#1L,cost#2,is_delete#3] 
Batched: false, DataFilters: [isnotnull(is_delete#3), (is_delete#3 = 0), 
isnotnull(trade_id#1L)|#3), (is_delete#3 = 0), isnotnull(trade_id#1L)], Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order_goods],
 PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
 +- Sort [trade_id#5L ASC NULLS FIRST|#5L ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(trade_id#5L, 6), true, [id=#33|#33]
 +- Project [trade_id#5L, company_id#6L|#5L, company_id#6L]
 +- Filter ((isnotnull(is_delete#7) AND (is_delete#7 = 0)) AND 
isnotnull(trade_id#5L))
 +- FileScan parquet 
oms.trade_order[trade_id#5L,company_id#6L,is_delete#7|#5L,company_id#6L,is_delete#7]
 Batched: false, DataFilters: [isnotnull(is_delete#7), (is_delete#7 = 0), 
isnotnull(trade_id#5L)|#7), (is_delete#7 = 0), isnotnull(trade_id#5L)], Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order], 
PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
{quote}
Physical Plan 2:
{quote}+- AdaptiveSparkPlan isFinalPlan=true
 +- *(6) HashAggregate(keys=[company_id#6L|#6L], functions=[sum(cost#2)|#2)], 
output=[company_id#6L, cost#28|#6L, cost#28])
 +- CustomShuffleReader coalesced
 +- ShuffleQueryStage 2
 +- Exchange hashpartitioning(company_id#6L, 6), true, [id=#244|#244]
 +- *(5) HashAggregate(keys=[company_id#6L|#6L], 
functions=[partial_sum(cost#2)|#2)], output=[company_id#6L, sum#21|#6L, sum#21])
 +- *(5) Project [cost#2, company_id#6L|#2, company_id#6L]
 +- *(5) SortMergeJoin [trade_id#1L|#1L], [trade_id#5L|#5L], Inner
 :- *(3) Sort [trade_id#1L ASC NULLS FIRST|#1L ASC NULLS FIRST], false, 0
 : +- CustomShuffleReader coalesced
 : +- ShuffleQueryStage 0
 : +- Exchange hashpartitioning(trade_id#1L, 6), true, [id=#119|#119]
 : +- *(1) Project [trade_id#1L, cost#2|#1L, cost#2]
 : +- *(1) Filter ((isnotnull(is_delete#3) AND (is_delete#3 = 0)) AND 
isnotnull(trade_id#1L))
 : +- FileScan parquet 
oms.trade_order_goods[trade_id#1L,cost#2,is_delete#3|#1L,cost#2,is_delete#3] 
Batched: false, DataFilters: [isnotnull(is_delete#3), (is_delete#3 = 0), 
isnotnull(trade_id#1L)|#3), (is_delete#3 = 0), isnotnull(trade_id#1L)], Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order_goods],
 PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
 +- *(4) Sort [trade_id#5L ASC NULLS FIRST|#5L ASC NULLS FIRST], false, 0
 +- CustomShuffleReader coalesced
 +- ShuffleQueryStage 1
 +- Exchange hashpartitioning(trade_id#5L, 6), true, [id=#126|#126]
 +- *(2) Project [trade_id#5L, company_id#6L|#5L, company_id#6L]
 +- *(2) Filter ((isnotnull(is_delete#7) AND (is_delete#7 = 0)) AND 
isnotnull(trade_id#5L))
 +- FileScan parquet 
oms.trade_order[trade_id#5L,company_id#6L,is_delete#7|#5L,company_id#6L,is_delete#7]
 Batched: false, DataFilters: 

[jira] [Updated] (SPARK-34985) Different execution plans under jdbc and hdfs

2021-04-07 Thread lianjunzhi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lianjunzhi updated SPARK-34985:
---
Description: 
Hive has two non-partitioned tables, trade_order and trade_order_goods. 
Trade_order contains four fields: trade_id, company_id, is_delete, and 
trade_status. trade_order_goods contains four fields: trade_id, cost, 
is_delete, and sell_total. Run the following code snippets:

 
|val df = spark.sql(|
|"""|
|select|
|b.company_id,|
|sum(a.cost) cost|
|FROM oms.trade_order_goods a|
|JOIN oms.trade_order b|
|ON a.trade_id = b.trade_id|
|WHERE a.is_delete = 0 AND b.is_delete = 0|
|GROUP BY|
|b.company_id|
|""".stripMargin)|
{quote}df.explain() //Physical Plan 1
{quote}
{quote}df.write.insertInto("oms.test") //Physical Plan 2
{quote}
{quote}df.write
 .format("jdbc")
 .option("url", "")
 .option("dbtable", "test")
 .option("user", "")
 .option("password", "")
 .option("driver", "com.mysql.jdbc.Driver")
 .option("truncate", value = true)
 .option("batchsize", 15000)
 .mode(SaveMode.Append)
 .save() //Physical Plan 3
{quote}
Physical Plan 1:
{quote}AdaptiveSparkPlan isFinalPlan=false
 +- HashAggregate(keys=[company_id#6L|#6L], functions=[sum(cost#2)|#2)])
 +- Exchange hashpartitioning(company_id#6L, 6), true, [id=#40|#40]
 +- HashAggregate(keys=[company_id#6L|#6L], functions=[partial_sum(cost#2)|#2)])
 +- Project [cost#2, company_id#6L|#2, company_id#6L]
 +- SortMergeJoin [trade_id#1L|#1L], [trade_id#5L|#5L], Inner
 :- Sort [trade_id#1L ASC NULLS FIRST|#1L ASC NULLS FIRST], false, 0
 : +- Exchange hashpartitioning(trade_id#1L, 6), true, [id=#32|#32]
 : +- Project [trade_id#1L, cost#2|#1L, cost#2]
 : +- Filter ((isnotnull(is_delete#3) AND (is_delete#3 = 0)) AND 
isnotnull(trade_id#1L))
 : +- FileScan parquet 
oms.trade_order_goods[trade_id#1L,cost#2,is_delete#3|#1L,cost#2,is_delete#3] 
Batched: false, DataFilters: [isnotnull(is_delete#3), (is_delete#3 = 0), 
isnotnull(trade_id#1L)|#3), (is_delete#3 = 0), isnotnull(trade_id#1L)], Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order_goods],
 PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
 +- Sort [trade_id#5L ASC NULLS FIRST|#5L ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(trade_id#5L, 6), true, [id=#33|#33]
 +- Project [trade_id#5L, company_id#6L|#5L, company_id#6L]
 +- Filter ((isnotnull(is_delete#7) AND (is_delete#7 = 0)) AND 
isnotnull(trade_id#5L))
 +- FileScan parquet 
oms.trade_order[trade_id#5L,company_id#6L,is_delete#7|#5L,company_id#6L,is_delete#7]
 Batched: false, DataFilters: [isnotnull(is_delete#7), (is_delete#7 = 0), 
isnotnull(trade_id#5L)|#7), (is_delete#7 = 0), isnotnull(trade_id#5L)], Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order], 
PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
{quote}
Physical Plan 2:
{quote}+- AdaptiveSparkPlan isFinalPlan=true
 +- *(6) HashAggregate(keys=[company_id#6L|#6L], functions=[sum(cost#2)|#2)], 
output=[company_id#6L, cost#28|#6L, cost#28])
 +- CustomShuffleReader coalesced
 +- ShuffleQueryStage 2
 +- Exchange hashpartitioning(company_id#6L, 6), true, [id=#244|#244]
 +- *(5) HashAggregate(keys=[company_id#6L|#6L], 
functions=[partial_sum(cost#2)|#2)], output=[company_id#6L, sum#21|#6L, sum#21])
 +- *(5) Project [cost#2, company_id#6L|#2, company_id#6L]
 +- *(5) SortMergeJoin [trade_id#1L|#1L], [trade_id#5L|#5L], Inner
 :- *(3) Sort [trade_id#1L ASC NULLS FIRST|#1L ASC NULLS FIRST], false, 0
 : +- CustomShuffleReader coalesced
 : +- ShuffleQueryStage 0
 : +- Exchange hashpartitioning(trade_id#1L, 6), true, [id=#119|#119]
 : +- *(1) Project [trade_id#1L, cost#2|#1L, cost#2]
 : +- *(1) Filter ((isnotnull(is_delete#3) AND (is_delete#3 = 0)) AND 
isnotnull(trade_id#1L))
 : +- FileScan parquet 
oms.trade_order_goods[trade_id#1L,cost#2,is_delete#3|#1L,cost#2,is_delete#3] 
Batched: false, DataFilters: [isnotnull(is_delete#3), (is_delete#3 = 0), 
isnotnull(trade_id#1L)|#3), (is_delete#3 = 0), isnotnull(trade_id#1L)], Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order_goods],
 PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
 +- *(4) Sort [trade_id#5L ASC NULLS FIRST|#5L ASC NULLS FIRST], false, 0
 +- CustomShuffleReader coalesced
 +- ShuffleQueryStage 1
 +- Exchange hashpartitioning(trade_id#5L, 6), true, [id=#126|#126]
 +- *(2) Project [trade_id#5L, company_id#6L|#5L, company_id#6L]
 +- *(2) Filter ((isnotnull(is_delete#7) AND (is_delete#7 = 0)) AND 
isnotnull(trade_id#5L))
 +- FileScan parquet 
oms.trade_order[trade_id#5L,company_id#6L,is_delete#7|#5L,company_id#6L,is_delete#7]
 Batched: false, DataFilters: 

[jira] [Updated] (SPARK-34985) Different execution plans under jdbc and hdfs

2021-04-07 Thread lianjunzhi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lianjunzhi updated SPARK-34985:
---
Description: 
Hive has two non-partitioned tables, trade_order and trade_order_goods. 
Trade_order contains four fields: trade_id, company_id, is_delete, and 
trade_status. trade_order_goods contains four fields: trade_id, cost, 
is_delete, and sell_total. Run the following code snippets:
{quote} 
|val df = spark.sql(|
|"""|
|select|
|b.company_id,|
|sum(a.cost) cost|
|FROM oms.trade_order_goods a|
|JOIN oms.trade_order b|
|ON a.trade_id = b.trade_id|
|WHERE a.is_delete = 0 AND b.is_delete = 0|
|GROUP BY|
|b.company_id|
|""".stripMargin)|
{quote}
{quote}df.explain() //Physical Plan 1
{quote}
{quote}df.write.insertInto("oms.test") //Physical Plan 2
{quote}
{quote}df.write
 .format("jdbc")
 .option("url", "")
 .option("dbtable", "test")
 .option("user", "")
 .option("password", "")
 .option("driver", "com.mysql.jdbc.Driver")
 .option("truncate", value = true)
 .option("batchsize", 15000)
 .mode(SaveMode.Append)
 .save() //Physical Plan 3
{quote}
Physical Plan 1:
{quote}AdaptiveSparkPlan isFinalPlan=false
 +- HashAggregate(keys=[company_id#6L|#6L], functions=[sum(cost#2)|#2)])
 +- Exchange hashpartitioning(company_id#6L, 6), true, [id=#40|#40]
 +- HashAggregate(keys=[company_id#6L|#6L], functions=[partial_sum(cost#2)|#2)])
 +- Project [cost#2, company_id#6L|#2, company_id#6L]
 +- SortMergeJoin [trade_id#1L|#1L], [trade_id#5L|#5L], Inner
 :- Sort [trade_id#1L ASC NULLS FIRST|#1L ASC NULLS FIRST], false, 0
 : +- Exchange hashpartitioning(trade_id#1L, 6), true, [id=#32|#32]
 : +- Project [trade_id#1L, cost#2|#1L, cost#2]
 : +- Filter ((isnotnull(is_delete#3) AND (is_delete#3 = 0)) AND 
isnotnull(trade_id#1L))
 : +- FileScan parquet 
oms.trade_order_goods[trade_id#1L,cost#2,is_delete#3|#1L,cost#2,is_delete#3] 
Batched: false, DataFilters: [isnotnull(is_delete#3), (is_delete#3 = 0), 
isnotnull(trade_id#1L)|#3), (is_delete#3 = 0), isnotnull(trade_id#1L)], Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order_goods],
 PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
 +- Sort [trade_id#5L ASC NULLS FIRST|#5L ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(trade_id#5L, 6), true, [id=#33|#33]
 +- Project [trade_id#5L, company_id#6L|#5L, company_id#6L]
 +- Filter ((isnotnull(is_delete#7) AND (is_delete#7 = 0)) AND 
isnotnull(trade_id#5L))
 +- FileScan parquet 
oms.trade_order[trade_id#5L,company_id#6L,is_delete#7|#5L,company_id#6L,is_delete#7]
 Batched: false, DataFilters: [isnotnull(is_delete#7), (is_delete#7 = 0), 
isnotnull(trade_id#5L)|#7), (is_delete#7 = 0), isnotnull(trade_id#5L)], Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order], 
PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
{quote}
Physical Plan 2:
{quote}+- AdaptiveSparkPlan isFinalPlan=true
 +- *(6) HashAggregate(keys=[company_id#6L|#6L], functions=[sum(cost#2)|#2)], 
output=[company_id#6L, cost#28|#6L, cost#28])
 +- CustomShuffleReader coalesced
 +- ShuffleQueryStage 2
 +- Exchange hashpartitioning(company_id#6L, 6), true, [id=#244|#244]
 +- *(5) HashAggregate(keys=[company_id#6L|#6L], 
functions=[partial_sum(cost#2)|#2)], output=[company_id#6L, sum#21|#6L, sum#21])
 +- *(5) Project [cost#2, company_id#6L|#2, company_id#6L]
 +- *(5) SortMergeJoin [trade_id#1L|#1L], [trade_id#5L|#5L], Inner
 :- *(3) Sort [trade_id#1L ASC NULLS FIRST|#1L ASC NULLS FIRST], false, 0
 : +- CustomShuffleReader coalesced
 : +- ShuffleQueryStage 0
 : +- Exchange hashpartitioning(trade_id#1L, 6), true, [id=#119|#119]
 : +- *(1) Project [trade_id#1L, cost#2|#1L, cost#2]
 : +- *(1) Filter ((isnotnull(is_delete#3) AND (is_delete#3 = 0)) AND 
isnotnull(trade_id#1L))
 : +- FileScan parquet 
oms.trade_order_goods[trade_id#1L,cost#2,is_delete#3|#1L,cost#2,is_delete#3] 
Batched: false, DataFilters: [isnotnull(is_delete#3), (is_delete#3 = 0), 
isnotnull(trade_id#1L)|#3), (is_delete#3 = 0), isnotnull(trade_id#1L)], Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order_goods],
 PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
 +- *(4) Sort [trade_id#5L ASC NULLS FIRST|#5L ASC NULLS FIRST], false, 0
 +- CustomShuffleReader coalesced
 +- ShuffleQueryStage 1
 +- Exchange hashpartitioning(trade_id#5L, 6), true, [id=#126|#126]
 +- *(2) Project [trade_id#5L, company_id#6L|#5L, company_id#6L]
 +- *(2) Filter ((isnotnull(is_delete#7) AND (is_delete#7 = 0)) AND 
isnotnull(trade_id#5L))
 +- FileScan parquet 
oms.trade_order[trade_id#5L,company_id#6L,is_delete#7|#5L,company_id#6L,is_delete#7]
 Batched: false, 

[jira] [Created] (SPARK-34985) Different execution plans under jdbc and hdfs

2021-04-07 Thread lianjunzhi (Jira)
lianjunzhi created SPARK-34985:
--

 Summary: Different execution plans under jdbc and hdfs
 Key: SPARK-34985
 URL: https://issues.apache.org/jira/browse/SPARK-34985
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
 Environment: spark 3.0.1

hive 2.1.1-cdh6.2.0

hadoop 3.0.0-cdh6.2.0

 
Reporter: lianjunzhi


Hive has two non-partitioned tables, trade_order and trade_order_goods. 
Trade_order contains four fields: trade_id, company_id, is_delete, and 
trade_status. trade_order_goods contains four fields: trade_id, cost, 
is_delete, and sell_total. Run the following code snippets:
{quote}val df = spark.sql(
 """
 |select
 |b.company_id,
 |sum(a.cost) cost
 |FROM oms.trade_order_goods a
 | JOIN oms.trade_order b
 |ON a.trade_id = b.trade_id
 |WHERE a.is_delete = 0 AND b.is_delete = 0
 |GROUP BY
 |b.company_id
 |""".stripMargin){quote}
{quote}df.explain() //Physical Plan 1{quote}
{quote}df.write.insertInto("oms.test") //Physical Plan 2{quote}
{quote}df.write
 .format("jdbc")
 .option("url", "")
 .option("dbtable", "test")
 .option("user", "")
 .option("password", "")
 .option("driver", "com.mysql.jdbc.Driver")
 .option("truncate", value = true)
 .option("batchsize", 15000)
 .mode(SaveMode.Append)
 .save() //Physical Plan 3{quote}
Physical Plan 1:
{quote}AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[company_id#6L], functions=[sum(cost#2)])
 +- Exchange hashpartitioning(company_id#6L, 6), true, [id=#40]
 +- HashAggregate(keys=[company_id#6L], functions=[partial_sum(cost#2)])
 +- Project [cost#2, company_id#6L]
 +- SortMergeJoin [trade_id#1L], [trade_id#5L], Inner
 :- Sort [trade_id#1L ASC NULLS FIRST], false, 0
 : +- Exchange hashpartitioning(trade_id#1L, 6), true, [id=#32]
 : +- Project [trade_id#1L, cost#2]
 : +- Filter ((isnotnull(is_delete#3) AND (is_delete#3 = 0)) AND 
isnotnull(trade_id#1L))
 : +- FileScan parquet oms.trade_order_goods[trade_id#1L,cost#2,is_delete#3] 
Batched: false, DataFilters: [isnotnull(is_delete#3), (is_delete#3 = 0), 
isnotnull(trade_id#1L)], Format: Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order_goods],
 PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
 +- Sort [trade_id#5L ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(trade_id#5L, 6), true, [id=#33]
 +- Project [trade_id#5L, company_id#6L]
 +- Filter ((isnotnull(is_delete#7) AND (is_delete#7 = 0)) AND 
isnotnull(trade_id#5L))
 +- FileScan parquet oms.trade_order[trade_id#5L,company_id#6L,is_delete#7] 
Batched: false, DataFilters: [isnotnull(is_delete#7), (is_delete#7 = 0), 
isnotnull(trade_id#5L)], Format: Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order], 
PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct{quote}
Physical Plan 2:
{quote}+- AdaptiveSparkPlan isFinalPlan=true
 +- *(6) HashAggregate(keys=[company_id#6L], functions=[sum(cost#2)], 
output=[company_id#6L, cost#28])
 +- CustomShuffleReader coalesced
 +- ShuffleQueryStage 2
 +- Exchange hashpartitioning(company_id#6L, 6), true, [id=#244]
 +- *(5) HashAggregate(keys=[company_id#6L], functions=[partial_sum(cost#2)], 
output=[company_id#6L, sum#21])
 +- *(5) Project [cost#2, company_id#6L]
 +- *(5) SortMergeJoin [trade_id#1L], [trade_id#5L], Inner
 :- *(3) Sort [trade_id#1L ASC NULLS FIRST], false, 0
 : +- CustomShuffleReader coalesced
 : +- ShuffleQueryStage 0
 : +- Exchange hashpartitioning(trade_id#1L, 6), true, [id=#119]
 : +- *(1) Project [trade_id#1L, cost#2]
 : +- *(1) Filter ((isnotnull(is_delete#3) AND (is_delete#3 = 0)) AND 
isnotnull(trade_id#1L))
 : +- FileScan parquet oms.trade_order_goods[trade_id#1L,cost#2,is_delete#3] 
Batched: false, DataFilters: [isnotnull(is_delete#3), (is_delete#3 = 0), 
isnotnull(trade_id#1L)], Format: Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order_goods],
 PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), IsNotNull(trade_id)], ReadSchema: 
struct
 +- *(4) Sort [trade_id#5L ASC NULLS FIRST], false, 0
 +- CustomShuffleReader coalesced
 +- ShuffleQueryStage 1
 +- Exchange hashpartitioning(trade_id#5L, 6), true, [id=#126]
 +- *(2) Project [trade_id#5L, company_id#6L]
 +- *(2) Filter ((isnotnull(is_delete#7) AND (is_delete#7 = 0)) AND 
isnotnull(trade_id#5L))
 +- FileScan parquet oms.trade_order[trade_id#5L,company_id#6L,is_delete#7] 
Batched: false, DataFilters: [isnotnull(is_delete#7), (is_delete#7 = 0), 
isnotnull(trade_id#5L)], Format: Parquet, Location: 
InMemoryFileIndex[hdfs://nameservice1/user/hive/warehouse/oms.db/trade_order], 
PartitionFilters: [], PushedFilters: [IsNotNull(is_delete), 
EqualTo(is_delete,0), 

[jira] [Created] (SPARK-34984) ANSI intervals formatting in hive results

2021-04-07 Thread Max Gekk (Jira)
Max Gekk created SPARK-34984:


 Summary: ANSI intervals formatting in hive results
 Key: SPARK-34984
 URL: https://issues.apache.org/jira/browse/SPARK-34984
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk


Support year-month and day-time intervals by HiveResult. This will allow to 
enable new interval types in *.sql tests, and use the intervals from spark-sql, 
for instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34970) Redact map-type options in the output of explain()

2021-04-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-34970:
---
Affects Version/s: (was: 3.2.0)
   3.0.0
   3.0.1
   3.0.2

> Redact map-type options in the output of explain()
> --
>
> Key: SPARK-34970
> URL: https://issues.apache.org/jira/browse/SPARK-34970
> Project: Spark
>  Issue Type: Task
>  Components: Security, SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.2
>
>
> The `explain()` method prints the arguments of tree nodes in logical/physical 
> plans. The arguments could contain a map-type option which contains sensitive 
> data.
> We should map-type options in the output of explain(), otherwise we will see 
> sensitive data in explain output or Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34970) Redact map-type options in the output of explain()

2021-04-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-34970:
---
Fix Version/s: 3.0.3
   3.2.0

> Redact map-type options in the output of explain()
> --
>
> Key: SPARK-34970
> URL: https://issues.apache.org/jira/browse/SPARK-34970
> Project: Spark
>  Issue Type: Task
>  Components: Security, SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> The `explain()` method prints the arguments of tree nodes in logical/physical 
> plans. The arguments could contain a map-type option which contains sensitive 
> data.
> We should map-type options in the output of explain(), otherwise we will see 
> sensitive data in explain output or Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34970) Redact map-type options in the output of explain()

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316839#comment-17316839
 ] 

Apache Spark commented on SPARK-34970:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32085

> Redact map-type options in the output of explain()
> --
>
> Key: SPARK-34970
> URL: https://issues.apache.org/jira/browse/SPARK-34970
> Project: Spark
>  Issue Type: Task
>  Components: Security, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.2
>
>
> The `explain()` method prints the arguments of tree nodes in logical/physical 
> plans. The arguments could contain a map-type option which contains sensitive 
> data.
> We should map-type options in the output of explain(), otherwise we will see 
> sensitive data in explain output or Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34970) Redact map-type options in the output of explain()

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316838#comment-17316838
 ] 

Apache Spark commented on SPARK-34970:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32085

> Redact map-type options in the output of explain()
> --
>
> Key: SPARK-34970
> URL: https://issues.apache.org/jira/browse/SPARK-34970
> Project: Spark
>  Issue Type: Task
>  Components: Security, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.2
>
>
> The `explain()` method prints the arguments of tree nodes in logical/physical 
> plans. The arguments could contain a map-type option which contains sensitive 
> data.
> We should map-type options in the output of explain(), otherwise we will see 
> sensitive data in explain output or Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34980) Support coalesce partition through union

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316833#comment-17316833
 ] 

Apache Spark commented on SPARK-34980:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32084

> Support coalesce partition through union
> 
>
> Key: SPARK-34980
> URL: https://issues.apache.org/jira/browse/SPARK-34980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> The rule `CoalesceShufflePartitions` can only coalesce paritition if
> * leaf node is ShuffleQueryStage
> * all shuffle have same partition number
> With `Union`, it might break the assumption. Let's say we have such plan
> {code:java}
> Union
>HashAggregate
>   ShuffleQueryStage
>FileScan
> {code}
> `CoalesceShufflePartitions` can not optimize it and the result partition 
> would be `shuffle partition + FileScan partition` which can be quite lagre.
> It's better to support partial optimize with `Union`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34980) Support coalesce partition through union

2021-04-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34980:


Assignee: Apache Spark

> Support coalesce partition through union
> 
>
> Key: SPARK-34980
> URL: https://issues.apache.org/jira/browse/SPARK-34980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> The rule `CoalesceShufflePartitions` can only coalesce paritition if
> * leaf node is ShuffleQueryStage
> * all shuffle have same partition number
> With `Union`, it might break the assumption. Let's say we have such plan
> {code:java}
> Union
>HashAggregate
>   ShuffleQueryStage
>FileScan
> {code}
> `CoalesceShufflePartitions` can not optimize it and the result partition 
> would be `shuffle partition + FileScan partition` which can be quite lagre.
> It's better to support partial optimize with `Union`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34980) Support coalesce partition through union

2021-04-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34980:


Assignee: (was: Apache Spark)

> Support coalesce partition through union
> 
>
> Key: SPARK-34980
> URL: https://issues.apache.org/jira/browse/SPARK-34980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> The rule `CoalesceShufflePartitions` can only coalesce paritition if
> * leaf node is ShuffleQueryStage
> * all shuffle have same partition number
> With `Union`, it might break the assumption. Let's say we have such plan
> {code:java}
> Union
>HashAggregate
>   ShuffleQueryStage
>FileScan
> {code}
> `CoalesceShufflePartitions` can not optimize it and the result partition 
> would be `shuffle partition + FileScan partition` which can be quite lagre.
> It's better to support partial optimize with `Union`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34983) Renaming the package alias from pp to ps

2021-04-07 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-34983:
---

 Summary: Renaming the package alias from pp to ps
 Key: SPARK-34983
 URL: https://issues.apache.org/jira/browse/SPARK-34983
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Haejoon Lee


Since the package alias for `pyspark.pandas` is fixed to `ps`, we should 
renaming it from whole Koalas source codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34886) Port/integrate Koalas DataFrame unit test into PySpark

2021-04-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34886:


Assignee: Apache Spark

> Port/integrate Koalas DataFrame unit test into PySpark
> --
>
> Key: SPARK-34886
> URL: https://issues.apache.org/jira/browse/SPARK-34886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA aims to port [Koalas DataFrame 
> test|https://github.com/databricks/koalas/tree/master/databricks/koalas/tests/test_dataframe.py]
>  appropriately to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34886) Port/integrate Koalas DataFrame unit test into PySpark

2021-04-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34886:


Assignee: (was: Apache Spark)

> Port/integrate Koalas DataFrame unit test into PySpark
> --
>
> Key: SPARK-34886
> URL: https://issues.apache.org/jira/browse/SPARK-34886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port [Koalas DataFrame 
> test|https://github.com/databricks/koalas/tree/master/databricks/koalas/tests/test_dataframe.py]
>  appropriately to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34886) Port/integrate Koalas DataFrame unit test into PySpark

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316824#comment-17316824
 ] 

Apache Spark commented on SPARK-34886:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32083

> Port/integrate Koalas DataFrame unit test into PySpark
> --
>
> Key: SPARK-34886
> URL: https://issues.apache.org/jira/browse/SPARK-34886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port [Koalas DataFrame 
> test|https://github.com/databricks/koalas/tree/master/databricks/koalas/tests/test_dataframe.py]
>  appropriately to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34886) Port/integrate Koalas DataFrame unit test into PySpark

2021-04-07 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-34886:
-
Description: 
This JIRA aims to port [Koalas DataFrame 
test|https://github.com/databricks/koalas/tree/master/databricks/koalas/tests/test_dataframe.py]
 appropriately to [PySpark 
tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].


  was:This JIRA aims to port [Koalas 
tests|https://github.com/databricks/koalas/tree/master/databricks/koalas/tests] 
appropriately to [PySpark 
tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].


> Port/integrate Koalas DataFrame unit test into PySpark
> --
>
> Key: SPARK-34886
> URL: https://issues.apache.org/jira/browse/SPARK-34886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port [Koalas DataFrame 
> test|https://github.com/databricks/koalas/tree/master/databricks/koalas/tests/test_dataframe.py]
>  appropriately to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34886) Port/integrate Koalas DataFrame unit test into PySpark

2021-04-07 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-34886:
-
Summary: Port/integrate Koalas DataFrame unit test into PySpark  (was: 
Port/integrate Koalas unit tests into PySpark)

> Port/integrate Koalas DataFrame unit test into PySpark
> --
>
> Key: SPARK-34886
> URL: https://issues.apache.org/jira/browse/SPARK-34886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port [Koalas 
> tests|https://github.com/databricks/koalas/tree/master/databricks/koalas/tests]
>  appropriately to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34922) Use better CBO cost function

2021-04-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34922:
-
Fix Version/s: 3.0.3
   3.1.2

> Use better CBO cost function
> 
>
> Key: SPARK-34922
> URL: https://issues.apache.org/jira/browse/SPARK-34922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> In SPARK-33935 we changed the CBO cost function such that it would be 
> symetric - A.betterThan(B) implies that !B.betterThan(A). Before both could 
> have been true.
> That change introduced a performance regression in some queries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316787#comment-17316787
 ] 

Apache Spark commented on SPARK-34981:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32082

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34981:


Assignee: (was: Apache Spark)

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34981:


Assignee: Apache Spark

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34982) Pyspark asDict() returns wrong child field for nested dataframe

2021-04-07 Thread Kumaresh AK (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh AK updated SPARK-34982:

Environment: 
Tested with EMR 6.2.0. spark v3.0.2, python v3.8.5

Also Tested with local pyspark on windows 10. spark v3.0.1. python v3.8.5

  was:
Tested with EMR 6.2.0. python: 3.8.5

Also Tested with local pyspark on windows. v: 3.0.1. python: 3.8.5


> Pyspark asDict() returns wrong child field for nested dataframe
> ---
>
> Key: SPARK-34982
> URL: https://issues.apache.org/jira/browse/SPARK-34982
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.0.2
> Environment: Tested with EMR 6.2.0. spark v3.0.2, python v3.8.5
> Also Tested with local pyspark on windows 10. spark v3.0.1. python v3.8.5
>Reporter: Kumaresh AK
>Priority: Minor
> Attachments: SPARK-34982.py
>
>
> Hello! I upgraded a job to Spark 3.0.1 (from 2.4.4) and encountered this 
> issue. The job uses asDict(True) in pyspark. I reproduced the issue with a 
> concise schema and code. Consider this example schema:
> {code:java}
> root
>  |-- id: integer (nullable = false)
>  |-- struct_1: struct (nullable = true)
>  | |-- array_1_1: array (nullable = true)
>  | | |-- element: string (containsNull = false)
>  |-- struct_2: struct (nullable = true)
>  | |-- array_2_1: array (nullable = true)
>  | | |-- element: string (containsNull = false){code}
> I created 100 rows with the above schema filled it with some numbers and 
> checked the row.asDict(True) against the input. For some rows
> {code:java}
> struct_1.array_1_1{code}
> is missing. Instead I get
> {code:java}
> struct_1.array_2_1{code}
> And I also observe this happens when array_1_1 is null. Example assert 
> failure:
> {code:java}
> AssertionError: {'id': 7, 'struct_1': {'array_2_1': None}, 'struct_2': 
> {'array_2_1': None}} != {'id': 7, 'struct_1': {'array_1_1': None}, 
> 'struct_2': {'array_2_1': None}}
> {code}
>  I have attached a minimal script that reproduces this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34928) CTE Execution fails for Sql Server

2021-04-07 Thread Supun De Silva (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314428#comment-17314428
 ] 

Supun De Silva edited comment on SPARK-34928 at 4/7/21, 11:02 PM:
--

[~hyukjin.kwon]

I added the driver information incase it becomes handy to isolate the issue. We 
have a similar setup for Oracle where CTEs execute with no issues but it 
(oracle connection) uses slightly different driver.


was (Author: supun.t.desilva):
[~hyukjin.kwon]

I added the driver information incase it becomes handy to isolate the issue. We 
have a similar setup for Oracle where CTEs ececute with no issues but it 
(oracle connection) uses slightly different driver.

> CTE Execution fails for Sql Server
> --
>
> Key: SPARK-34928
> URL: https://issues.apache.org/jira/browse/SPARK-34928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Supun De Silva
>Priority: Minor
>
> h2. Issue
> We have a simple Sql statement that we intend to execute on SQL Server. This 
> has a CTE component.
> Execution of this yields to an error that looks like follows
> {code:java}
> java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
> We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* (version 
> 1.3.1)
> This is a particularly annoying issue and due to this we are having to write 
> inner queries that are fair bit inefficient.
> h2. SQL statement
> (not the actual one but a simplified version with renamed parameters)
>  
> {code:sql}
> WITH OldChanges as (
>SELECT distinct 
> SomeDate,
> Name
>FROM [dbo].[DateNameFoo] (nolock)
>WHERE SomeDate!= '2021-03-30'
>AND convert(date, UpdateDateTime) = '2021-03-31'
> SELECT * from OldChanges {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34982) Pyspark asDict() returns wrong child field for nested dataframe

2021-04-07 Thread Kumaresh AK (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh AK updated SPARK-34982:

Priority: Minor  (was: Major)

> Pyspark asDict() returns wrong child field for nested dataframe
> ---
>
> Key: SPARK-34982
> URL: https://issues.apache.org/jira/browse/SPARK-34982
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.0.2
> Environment: Tested with EMR 6.2.0. python: 3.8.5
> Also Tested with local pyspark on windows. v: 3.0.1. python: 3.8.5
>Reporter: Kumaresh AK
>Priority: Minor
> Attachments: SPARK-34982.py
>
>
> Hello! I upgraded a job to Spark 3.0.1 (from 2.4.4) and encountered this 
> issue. The job uses asDict(True) in pyspark. I reproduced the issue with a 
> concise schema and code. Consider this example schema:
> {code:java}
> root
>  |-- id: integer (nullable = false)
>  |-- struct_1: struct (nullable = true)
>  | |-- array_1_1: array (nullable = true)
>  | | |-- element: string (containsNull = false)
>  |-- struct_2: struct (nullable = true)
>  | |-- array_2_1: array (nullable = true)
>  | | |-- element: string (containsNull = false){code}
> I created 100 rows with the above schema filled it with some numbers and 
> checked the row.asDict(True) against the input. For some rows
> {code:java}
> struct_1.array_1_1{code}
> is missing. Instead I get
> {code:java}
> struct_1.array_2_1{code}
> And I also observe this happens when array_1_1 is null. Example assert 
> failure:
> {code:java}
> AssertionError: {'id': 7, 'struct_1': {'array_2_1': None}, 'struct_2': 
> {'array_2_1': None}} != {'id': 7, 'struct_1': {'array_1_1': None}, 
> 'struct_2': {'array_2_1': None}}
> {code}
>  I have attached a minimal script that reproduces this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34982) Pyspark asDict() returns wrong child field for nested dataframe

2021-04-07 Thread Kumaresh AK (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh AK updated SPARK-34982:

Summary: Pyspark asDict() returns wrong child field for nested dataframe  
(was: Pyspark asDict() returns wrong fields for a nested dataframe)

> Pyspark asDict() returns wrong child field for nested dataframe
> ---
>
> Key: SPARK-34982
> URL: https://issues.apache.org/jira/browse/SPARK-34982
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.0.2
> Environment: Tested with EMR 6.2.0. python: 3.8.5
> Also Tested with local pyspark on windows. v: 3.0.1. python: 3.8.5
>Reporter: Kumaresh AK
>Priority: Major
> Attachments: SPARK-34982.py
>
>
> Hello! I upgraded a job to Spark 3.0.1 (from 2.4.4) and encountered this 
> issue. The job uses asDict(True) in pyspark. I reproduced the issue with a 
> concise schema and code. Consider this example schema:
> {code:java}
> root
>  |-- id: integer (nullable = false)
>  |-- struct_1: struct (nullable = true)
>  | |-- array_1_1: array (nullable = true)
>  | | |-- element: string (containsNull = false)
>  |-- struct_2: struct (nullable = true)
>  | |-- array_2_1: array (nullable = true)
>  | | |-- element: string (containsNull = false){code}
> I created 100 rows with the above schema filled it with some numbers and 
> checked the row.asDict(True) against the input. For some rows
> {code:java}
> struct_1.array_1_1{code}
> is missing. Instead I get
> {code:java}
> struct_1.array_2_1{code}
> And I also observe this happens when array_1_1 is null. Example assert 
> failure:
> {code:java}
> AssertionError: {'id': 7, 'struct_1': {'array_2_1': None}, 'struct_2': 
> {'array_2_1': None}} != {'id': 7, 'struct_1': {'array_1_1': None}, 
> 'struct_2': {'array_2_1': None}}
> {code}
>  I have attached a minimal script that reproduces this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34982) Pyspark asDict() returns wrong fields for a nested dataframe

2021-04-07 Thread Kumaresh AK (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh AK updated SPARK-34982:

Description: 
Hello! I upgraded a job to Spark 3.0.1 (from 2.4.4) and encountered this issue. 
The job uses asDict(True) in pyspark. I reproduced the issue with a concise 
schema and code. Consider this example schema:
{code:java}
root
 |-- id: integer (nullable = false)
 |-- struct_1: struct (nullable = true)
 | |-- array_1_1: array (nullable = true)
 | | |-- element: string (containsNull = false)
 |-- struct_2: struct (nullable = true)
 | |-- array_2_1: array (nullable = true)
 | | |-- element: string (containsNull = false){code}
I created 100 rows with the above schema filled it with some numbers and 
checked the row.asDict(True) against the input. For some rows
{code:java}
struct_1.array_1_1{code}
is missing. Instead I get
{code:java}
struct_1.array_2_1{code}
And I also observe this happens when array_1_1 is null. Example assert failure:
{code:java}
AssertionError: {'id': 7, 'struct_1': {'array_2_1': None}, 'struct_2': 
{'array_2_1': None}} != {'id': 7, 'struct_1': {'array_1_1': None}, 'struct_2': 
{'array_2_1': None}}

{code}
 I have attached a minimal script that reproduces this issue.

  was:
Hello! I upgraded a job to Spark 3.0.1 (from 2.4.4) and encountered this issue. 
The job uses asDict(True) in pyspark. I reproduced the issue with a concise 
schema and code. Consider this example schema:
{code:java}
root
 |-- id: integer (nullable = false)
 |-- struct_1: struct (nullable = true)
 | |-- array_1_1: array (nullable = true)
 | | |-- element: string (containsNull = false)
 |-- struct_2: struct (nullable = true)
 | |-- array_2_1: array (nullable = true)
 | | |-- element: string (containsNull = false){code}
I created 100 rows with the above schema filled it with some numbers and 
checked the row.asDict(True) against the input. For some rows
{code:java}
struct_1.array_1_1{code}
is missing. Instead I get
{code:java}
struct_1.array_2_1{code}
And I also observe this happens when array_1_1 is null. Example assert failure:
{code:java}
AssertionError: {'id': 7, 'struct_1': {'array_2_1': None}, 'struct_2': 
{'array_2_1': None}} != {'id': 7, 'struct_1': {'array_1_1': None}, 'struct_2': 
{'array_2_1': None}}

{code}
 


> Pyspark asDict() returns wrong fields for a nested dataframe
> 
>
> Key: SPARK-34982
> URL: https://issues.apache.org/jira/browse/SPARK-34982
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.0.2
> Environment: Tested with EMR 6.2.0. python: 3.8.5
> Also Tested with local pyspark on windows. v: 3.0.1. python: 3.8.5
>Reporter: Kumaresh AK
>Priority: Major
> Attachments: SPARK-34982.py
>
>
> Hello! I upgraded a job to Spark 3.0.1 (from 2.4.4) and encountered this 
> issue. The job uses asDict(True) in pyspark. I reproduced the issue with a 
> concise schema and code. Consider this example schema:
> {code:java}
> root
>  |-- id: integer (nullable = false)
>  |-- struct_1: struct (nullable = true)
>  | |-- array_1_1: array (nullable = true)
>  | | |-- element: string (containsNull = false)
>  |-- struct_2: struct (nullable = true)
>  | |-- array_2_1: array (nullable = true)
>  | | |-- element: string (containsNull = false){code}
> I created 100 rows with the above schema filled it with some numbers and 
> checked the row.asDict(True) against the input. For some rows
> {code:java}
> struct_1.array_1_1{code}
> is missing. Instead I get
> {code:java}
> struct_1.array_2_1{code}
> And I also observe this happens when array_1_1 is null. Example assert 
> failure:
> {code:java}
> AssertionError: {'id': 7, 'struct_1': {'array_2_1': None}, 'struct_2': 
> {'array_2_1': None}} != {'id': 7, 'struct_1': {'array_1_1': None}, 
> 'struct_2': {'array_2_1': None}}
> {code}
>  I have attached a minimal script that reproduces this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34982) Pyspark asDict() returns wrong fields for a nested dataframe

2021-04-07 Thread Kumaresh AK (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh AK updated SPARK-34982:

Attachment: SPARK-34982.py

> Pyspark asDict() returns wrong fields for a nested dataframe
> 
>
> Key: SPARK-34982
> URL: https://issues.apache.org/jira/browse/SPARK-34982
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.0.2
> Environment: Tested with EMR 6.2.0. python: 3.8.5
> Also Tested with local pyspark on windows. v: 3.0.1. python: 3.8.5
>Reporter: Kumaresh AK
>Priority: Major
> Attachments: SPARK-34982.py
>
>
> Hello! I upgraded a job to Spark 3.0.1 (from 2.4.4) and encountered this 
> issue. The job uses asDict(True) in pyspark. I reproduced the issue with a 
> concise schema and code. Consider this example schema:
> {code:java}
> root
>  |-- id: integer (nullable = false)
>  |-- struct_1: struct (nullable = true)
>  | |-- array_1_1: array (nullable = true)
>  | | |-- element: string (containsNull = false)
>  |-- struct_2: struct (nullable = true)
>  | |-- array_2_1: array (nullable = true)
>  | | |-- element: string (containsNull = false){code}
> I created 100 rows with the above schema filled it with some numbers and 
> checked the row.asDict(True) against the input. For some rows
> {code:java}
> struct_1.array_1_1{code}
> is missing. Instead I get
> {code:java}
> struct_1.array_2_1{code}
> And I also observe this happens when array_1_1 is null. Example assert 
> failure:
> {code:java}
> AssertionError: {'id': 7, 'struct_1': {'array_2_1': None}, 'struct_2': 
> {'array_2_1': None}} != {'id': 7, 'struct_1': {'array_1_1': None}, 
> 'struct_2': {'array_2_1': None}}
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34982) Pyspark asDict() returns wrong fields for a nested dataframe

2021-04-07 Thread Kumaresh AK (Jira)
Kumaresh AK created SPARK-34982:
---

 Summary: Pyspark asDict() returns wrong fields for a nested 
dataframe
 Key: SPARK-34982
 URL: https://issues.apache.org/jira/browse/SPARK-34982
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.2, 3.0.1
 Environment: Tested with EMR 6.2.0. python: 3.8.5

Also Tested with local pyspark on windows. v: 3.0.1. python: 3.8.5
Reporter: Kumaresh AK


Hello! I upgraded a job to Spark 3.0.1 (from 2.4.4) and encountered this issue. 
The job uses asDict(True) in pyspark. I reproduced the issue with a concise 
schema and code. Consider this example schema:
{code:java}
root
 |-- id: integer (nullable = false)
 |-- struct_1: struct (nullable = true)
 | |-- array_1_1: array (nullable = true)
 | | |-- element: string (containsNull = false)
 |-- struct_2: struct (nullable = true)
 | |-- array_2_1: array (nullable = true)
 | | |-- element: string (containsNull = false){code}
I created 100 rows with the above schema filled it with some numbers and 
checked the row.asDict(True) against the input. For some rows
{code:java}
struct_1.array_1_1{code}
is missing. Instead I get
{code:java}
struct_1.array_2_1{code}
And I also observe this happens when array_1_1 is null. Example assert failure:
{code:java}
AssertionError: {'id': 7, 'struct_1': {'array_2_1': None}, 'struct_2': 
{'array_2_1': None}} != {'id': 7, 'struct_1': {'array_1_1': None}, 'struct_2': 
{'array_2_1': None}}

{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-04-07 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316714#comment-17316714
 ] 

Chao Sun commented on SPARK-34780:
--

Hi [~mikechen] (and sorry for the late reply again), thanks for providing 
another very useful code snippet! I'm not sure if this qualifies as correctness 
issue though since it is (to me) more like different interpretations of 
malformed columns in CSV? 

My previous statement about {{SessionState}} is incorrect. It seems the conf in 
{{SessionState}} is always the most up-to-date one. The only solution I can 
think of to solve this issue is to take conf into account when checking 
equality of {{HadoopFsRelation}} (and potentially others), which means we'd 
need to define equality for {{SQLConf}}..

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-07 Thread Sergey Kotlov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316697#comment-17316697
 ] 

Sergey Kotlov commented on SPARK-34674:
---

The fix that I currently use:  [https://github.com/apache/spark/pull/32081]

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316694#comment-17316694
 ] 

Apache Spark commented on SPARK-34674:
--

User 'kotlovs' has created a pull request for this issue:
https://github.com/apache/spark/pull/32081

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316695#comment-17316695
 ] 

Apache Spark commented on SPARK-34674:
--

User 'kotlovs' has created a pull request for this issue:
https://github.com/apache/spark/pull/32081

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32634) Introduce sort-based fallback mechanism for shuffled hash join

2021-04-07 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316683#comment-17316683
 ] 

Cheng Su commented on SPARK-32634:
--

[~Thomas Liu] - Implement fallback mechanism for whole stage code-gen, we need 
to override `doProduce()` method as well. Similar case is sort-based fallback 
for hash aggregate (e.g. `doProduceWithKeys` - 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L679]).

For progress, sorry that I was busy with something else, I will work on this 
soon in next weeks, and target date is Spark 3.2.0 release, thanks.

> Introduce sort-based fallback mechanism for shuffled hash join 
> ---
>
> Key: SPARK-32634
> URL: https://issues.apache.org/jira/browse/SPARK-32634
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> A major pain point for spark users to stay away from using shuffled hash join 
> is out of memory issue. Shuffled hash join tends to have OOM issue because it 
> allocates in-memory hashed relation (`UnsafeHashedRelation` or 
> `LongHashedRelation`) for build side, and there's no recovery (e.g. 
> fallback/spill) once the size of hashed relation grows and cannot fit in 
> memory. On the other hand, shuffled hash join is more CPU and IO efficient 
> than sort merge join when joining one large table and a small table (but 
> small table is too large to be broadcasted), as SHJ does not sort the large 
> table, but SMJ needs to do that.
> To improve the reliability of shuffled hash join, a fallback mechanism can be 
> introduced to avoid shuffled hash join OOM issue completely. Similarly we 
> already have a fallback to sort-based aggregation for hash aggregate. The 
> idea is:
> (1).Build hashed relation as current, but monitor the hashed relation size 
> when inserting each build side row. If size of hashed relation being always 
> smaller than a configurable threshold, go to (2.1), else go to (2.2).
> (2.1).Current shuffled hash join logic: reading stream side rows and probing 
> hashed relation.
> (2.2).Fall back to sort merge join: Sort stream side rows, and sort build 
> side rows (iterate rows already in hashed relation (e.g. through 
> `BytesToBytesMap.destructiveIterator`), then iterate rest of un-read build 
> side rows). Then doing sort merge join for stream + build side rows.
>  
> Note:
> (1).the fallback is dynamic and happened per task, which means task 0 can 
> incur the fallback e.g. if it has a big build side, but task 1,2 don't need 
> to incur the fallback depending on the size of hashed relation.
> (2).there's no major code change for SHJ and SMJ. Major change is around 
> HashedRelation to introduce some new methods, e.g. 
> `HashedRelation.destructiveValues()` to return an Iterator of build side rows 
> in hashed relation and cleaning up hashed relation along the way.
> (3).we have run this feature by default in our internal fork more than 2 
> years, and we benefit a lot from it with users can choose to use SHJ, and we 
> don't need to worry about SHJ reliability (see 
> https://issues.apache.org/jira/browse/SPARK-21505 for the original proposal 
> from our side, I tweak here to make it less intrusive and more acceptable, 
> e.g. not introducing a separate join operator, but doing the fallback 
> automatically inside SHJ operator itself).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34955) ADD JAR command cannot add jar files which contains whitespaces in the path

2021-04-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34955.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32052
[https://github.com/apache/spark/pull/32052]

> ADD JAR command cannot add jar files which contains whitespaces in the path
> ---
>
> Key: SPARK-34955
> URL: https://issues.apache.org/jira/browse/SPARK-34955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> ADD JAR command cannot add jar files which contains white spaces in the path.
> If we have `/some/path/test file.jar` and execute the following command:
> {code}
> ADD JAR "/some/path/test file.jar";
> {code}
> The following exception is thrown.
> {code}
> 21/04/05 10:40:38 ERROR SparkSQLDriver: Failed in [add jar "/some/path/test 
> file.jar"]
> java.lang.IllegalArgumentException: Illegal character in path at index 9: 
> /some/path/test file.jar
>   at java.net.URI.create(URI.java:852)
>   at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129)
>   at 
> org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:34)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
> {code}
> This is because `HiveSessionStateBuilder` and `SessionStateBuilder` don't 
> check whether the form of the path is URI or plain path and it always regards 
> the path as URI form.
> Whitespces should be encoded to `%20` so `/some/path/test file.jar` is 
> rejected.
> We can resolve this part by checking whether the given path is URI form or 
> not.
> Unfortunatelly, if we fix this part, another problem occurs.
> When we execute `ADD JAR` command, Hive's `ADD JAR` command is executed in 
> `HiveClientImpl.addJar` and `AddResourceProcessor.run` is transitively 
> invoked.
> In `AddResourceProcessor.run`, the command line is just split by `\\s+` and 
> the path is also split into `/some/path/test` and `file.jar` and passed to 
> `ss.add_resources`.
> https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProcessor.java#L56-L75
> So, the command still fails.
> Even if we convert the form of the path to URI like 
> `file:/some/path/test%20file.jar` and execute the following command:
> {code}
> ADD JAR "file:/some/path/test%20file";
> {code}
> The following exception is thrown.
> {code}
> 21/04/05 10:40:53 ERROR SessionState: file:/some/path/test%20file.jar does 
> not exist
> java.lang.IllegalArgumentException: file:/some/path/test%20file.jar does not 
> exist
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:1168)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1289)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1278)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1378)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1336)
>   at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:74)
> {code}
> The reason is `Utilities.realFile` invoked in `SessionState.validateFiles` 
> returns `null` as the result of `fs.exists(path)` is `false`.
> https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1052-L1064
> `fs.exists` checks the existence of the given path by comparing the string 
> representation of Hadoop's `Path`.
> The string representation of `Path` is similar to URI but it's actually 
> different.
> `Path` doesn't encode the given path.
> For example, the URI form of `/some/path/jar file.jar` is 
> `file:/some/path/jar%20file.jar` but the `Path` form of it is 
> `file:/some/path/jar file.jar`. So `fs.exists` returns false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34970) Redact map-type options in the output of explain()

2021-04-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-34970.

Fix Version/s: 3.1.2
   Resolution: Fixed

Issue resolved by pull request 32079
[https://github.com/apache/spark/pull/32079]

> Redact map-type options in the output of explain()
> --
>
> Key: SPARK-34970
> URL: https://issues.apache.org/jira/browse/SPARK-34970
> Project: Spark
>  Issue Type: Task
>  Components: Security, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.2
>
>
> The `explain()` method prints the arguments of tree nodes in logical/physical 
> plans. The arguments could contain a map-type option which contains sensitive 
> data.
> We should map-type options in the output of explain(), otherwise we will see 
> sensitive data in explain output or Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-07 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316477#comment-17316477
 ] 

Chao Sun commented on SPARK-34981:
--

Will submit a PR soon.

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34981) Implement V2 function resolution and evaluation

2021-04-07 Thread Chao Sun (Jira)
Chao Sun created SPARK-34981:


 Summary: Implement V2 function resolution and evaluation 
 Key: SPARK-34981
 URL: https://issues.apache.org/jira/browse/SPARK-34981
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims at 
implementing the function resolution (in analyzer) and evaluation by wrapping 
them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34493) Create "TEXT Files" page for Data Source documents.

2021-04-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-34493:


Assignee: Haejoon Lee

> Create "TEXT Files" page for Data Source documents.
> ---
>
> Key: SPARK-34493
> URL: https://issues.apache.org/jira/browse/SPARK-34493
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Adding "TEXT Files" page to [Data Sources 
> documents|https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources]
>  which is missing now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34493) Create "TEXT Files" page for Data Source documents.

2021-04-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-34493.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32053
[https://github.com/apache/spark/pull/32053]

> Create "TEXT Files" page for Data Source documents.
> ---
>
> Key: SPARK-34493
> URL: https://issues.apache.org/jira/browse/SPARK-34493
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> Adding "TEXT Files" page to [Data Sources 
> documents|https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources]
>  which is missing now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-04-07 Thread Aaron Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316361#comment-17316361
 ] 

Aaron Wang commented on SPARK-34064:


May I ask, when will this be resolved?

> Broadcast job is not aborted even the SQL statement canceled
> 
>
> Key: SPARK-34064
> URL: https://issues.apache.org/jira/browse/SPARK-34064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Priority: Minor
> Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png
>
>
> SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the 
> problem that a broadcast job is not aborted when broadcast timeout happens. 
> Since the runId is a random UUID, when a SQL statement is cancelled, these 
> broadcast sub-jobs still not canceled as a whole.
>  !Screen Shot 2021-01-11 at 12.03.13 PM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34668) Support casting of day-time intervals to strings

2021-04-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34668:
---

Assignee: Max Gekk

> Support casting of day-time intervals to strings
> 
>
> Key: SPARK-34668
> URL: https://issues.apache.org/jira/browse/SPARK-34668
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Extend the Cast expression and support DayTimeIntervalType in casting to 
> StringType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34668) Support casting of day-time intervals to strings

2021-04-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34668.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32070
[https://github.com/apache/spark/pull/32070]

> Support casting of day-time intervals to strings
> 
>
> Key: SPARK-34668
> URL: https://issues.apache.org/jira/browse/SPARK-34668
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Extend the Cast expression and support DayTimeIntervalType in casting to 
> StringType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34976) Rename GroupingSet to GroupingAnalytic

2021-04-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34976:
---

Assignee: angerszhu

> Rename GroupingSet to GroupingAnalytic
> --
>
> Key: SPARK-34976
> URL: https://issues.apache.org/jira/browse/SPARK-34976
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Rename GroupingSet to GroupingAnalytic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34976) Rename GroupingSet to GroupingAnalytic

2021-04-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34976.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32073
[https://github.com/apache/spark/pull/32073]

> Rename GroupingSet to GroupingAnalytic
> --
>
> Key: SPARK-34976
> URL: https://issues.apache.org/jira/browse/SPARK-34976
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Rename GroupingSet to GroupingAnalytic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64

2021-04-07 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-34979:

Summary: Failed to install pyspark[sql] (due to pyarrow error) on aarch64  
(was: Failed to install pyspark[sql] (due to pyarrow error))

> Failed to install pyspark[sql] (due to pyarrow error) on aarch64
> 
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> ~^$ pip install pyspark[sql]^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^error: command 'cmake' failed with exit status 1^~
>  ~^^~
>  ~^ERROR: Failed building wheel for pyarrow^~
>  ~^Failed to build pyarrow^~
>  ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly^~
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32634) Introduce sort-based fallback mechanism for shuffled hash join

2021-04-07 Thread Lietong Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316327#comment-17316327
 ] 

Lietong Liu commented on SPARK-32634:
-

[~chengsu] I have a question about implementing of fallback mechanism.  Since 
current `doConsume` of ShuffledHashJoinExec consume one row from stream side 
once, not like SortMergeJoinExex, how we sort stream side when fallback is 
enabled? Looking forward to your reply!

> Introduce sort-based fallback mechanism for shuffled hash join 
> ---
>
> Key: SPARK-32634
> URL: https://issues.apache.org/jira/browse/SPARK-32634
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> A major pain point for spark users to stay away from using shuffled hash join 
> is out of memory issue. Shuffled hash join tends to have OOM issue because it 
> allocates in-memory hashed relation (`UnsafeHashedRelation` or 
> `LongHashedRelation`) for build side, and there's no recovery (e.g. 
> fallback/spill) once the size of hashed relation grows and cannot fit in 
> memory. On the other hand, shuffled hash join is more CPU and IO efficient 
> than sort merge join when joining one large table and a small table (but 
> small table is too large to be broadcasted), as SHJ does not sort the large 
> table, but SMJ needs to do that.
> To improve the reliability of shuffled hash join, a fallback mechanism can be 
> introduced to avoid shuffled hash join OOM issue completely. Similarly we 
> already have a fallback to sort-based aggregation for hash aggregate. The 
> idea is:
> (1).Build hashed relation as current, but monitor the hashed relation size 
> when inserting each build side row. If size of hashed relation being always 
> smaller than a configurable threshold, go to (2.1), else go to (2.2).
> (2.1).Current shuffled hash join logic: reading stream side rows and probing 
> hashed relation.
> (2.2).Fall back to sort merge join: Sort stream side rows, and sort build 
> side rows (iterate rows already in hashed relation (e.g. through 
> `BytesToBytesMap.destructiveIterator`), then iterate rest of un-read build 
> side rows). Then doing sort merge join for stream + build side rows.
>  
> Note:
> (1).the fallback is dynamic and happened per task, which means task 0 can 
> incur the fallback e.g. if it has a big build side, but task 1,2 don't need 
> to incur the fallback depending on the size of hashed relation.
> (2).there's no major code change for SHJ and SMJ. Major change is around 
> HashedRelation to introduce some new methods, e.g. 
> `HashedRelation.destructiveValues()` to return an Iterator of build side rows 
> in hashed relation and cleaning up hashed relation along the way.
> (3).we have run this feature by default in our internal fork more than 2 
> years, and we benefit a lot from it with users can choose to use SHJ, and we 
> don't need to worry about SHJ reliability (see 
> https://issues.apache.org/jira/browse/SPARK-21505 for the original proposal 
> from our side, I tweak here to make it less intrusive and more acceptable, 
> e.g. not introducing a separate join operator, but doing the fallback 
> automatically inside SHJ operator itself).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34674:


Assignee: Apache Spark

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Assignee: Apache Spark
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34674:


Assignee: (was: Apache Spark)

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316319#comment-17316319
 ] 

Apache Spark commented on SPARK-34674:
--

User 'KarlManong' has created a pull request for this issue:
https://github.com/apache/spark/pull/32080

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-04-07 Thread KarlManong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316286#comment-17316286
 ] 

KarlManong commented on SPARK-34674:


I have the same problem, and I submitted a PR: 
https://github.com/apache/spark/pull/32080

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34972) Make doctests work in Spark.

2021-04-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34972.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32069
[https://github.com/apache/spark/pull/32069]

> Make doctests work in Spark.
> 
>
> Key: SPARK-34972
> URL: https://issues.apache.org/jira/browse/SPARK-34972
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34972) Make doctests work in Spark.

2021-04-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34972:


Assignee: Takuya Ueshin

> Make doctests work in Spark.
> 
>
> Key: SPARK-34972
> URL: https://issues.apache.org/jira/browse/SPARK-34972
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34970) Redact map-type options in the output of explain()

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316218#comment-17316218
 ] 

Apache Spark commented on SPARK-34970:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32079

> Redact map-type options in the output of explain()
> --
>
> Key: SPARK-34970
> URL: https://issues.apache.org/jira/browse/SPARK-34970
> Project: Spark
>  Issue Type: Task
>  Components: Security, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The `explain()` method prints the arguments of tree nodes in logical/physical 
> plans. The arguments could contain a map-type option which contains sensitive 
> data.
> We should map-type options in the output of explain(), otherwise we will see 
> sensitive data in explain output or Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34980) Support coalesce partition through union

2021-04-07 Thread ulysses you (Jira)
ulysses you created SPARK-34980:
---

 Summary: Support coalesce partition through union
 Key: SPARK-34980
 URL: https://issues.apache.org/jira/browse/SPARK-34980
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: ulysses you


The rule `CoalesceShufflePartitions` can only coalesce paritition if
* leaf node is ShuffleQueryStage
* all shuffle have same partition number

With `Union`, it might break the assumption. Let's say we have such plan
{code:java}
Union
   HashAggregate
  ShuffleQueryStage
   FileScan
{code}

`CoalesceShufflePartitions` can not optimize it and the result partition would 
be `shuffle partition + FileScan partition` which can be quite lagre.

It's better to support partial optimize with `Union`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27658) Catalog API to load functions

2021-04-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27658:
---

Assignee: Ryan Blue

> Catalog API to load functions
> -
>
> Key: SPARK-27658
> URL: https://issues.apache.org/jira/browse/SPARK-27658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> SPARK-24252 added an API that catalog plugins can implement to expose table 
> operations. Catalogs should also be able to provide function implementations 
> to Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27658) Catalog API to load functions

2021-04-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27658.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 24559
[https://github.com/apache/spark/pull/24559]

> Catalog API to load functions
> -
>
> Key: SPARK-27658
> URL: https://issues.apache.org/jira/browse/SPARK-27658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.2.0
>
>
> SPARK-24252 added an API that catalog plugins can implement to expose table 
> operations. Catalogs should also be able to provide function implementations 
> to Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error)

2021-04-07 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-34979:

Description: 
~^$ pip install pyspark[sql]^~

~^Collecting pyarrow>=1.0.0^~
 ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
 ~^Installing build dependencies ... done^~
 ~^Getting requirements to build wheel ... done^~
 ~^Preparing wheel metadata ... done^~
 ~^// ... ...^~
 ~^Building wheels for collected packages: pyarrow^~
 ~^Building wheel for pyarrow (PEP 517) ... error^~
 ~^ERROR: Command errored out with exit status 1:^~
 ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
/tmp/tmpq0n5juib^~
 ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
 ~^Complete output (183 lines):^~

~^– Running cmake for pyarrow^~
 ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
-DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
-DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
-DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=off 
-DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off 
-DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
-DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
-DCMAKE_BUILD_TYPE=release /tmp/pip-install-sh0myu71/pyarrow^~
 ~^error: command 'cmake' failed with exit status 1^~
 ~^^~
 ~^ERROR: Failed building wheel for pyarrow^~
 ~^Failed to build pyarrow^~
 ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
installed directly^~

 

The pip installation would be failed, due to the dependency pyarrow install 
failed.

 

 

  was:
~^$ pip install pyspark[sql] --prefer-binary^~

~^Collecting pyarrow>=1.0.0^~
 ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
 ~^Installing build dependencies ... done^~
 ~^Getting requirements to build wheel ... done^~
 ~^Preparing wheel metadata ... done^~
 ~^// ... ...^~
 ~^Building wheels for collected packages: pyarrow^~
 ~^Building wheel for pyarrow (PEP 517) ... error^~
 ~^ERROR: Command errored out with exit status 1:^~
 ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
/tmp/tmpq0n5juib^~
 ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
 ~^Complete output (183 lines):^~

~^– Running cmake for pyarrow^~
 ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
-DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
-DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
-DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=off 
-DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off 
-DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
-DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
-DCMAKE_BUILD_TYPE=release /tmp/pip-install-sh0myu71/pyarrow^~
 ~^error: command 'cmake' failed with exit status 1^~
 ~^^~
 ~^ERROR: Failed building wheel for pyarrow^~
 ~^Failed to build pyarrow^~
 ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
installed directly^~

 

The pip installation would be failed, due to the dependency pyarrow install 
failed.

 

 


> Failed to install pyspark[sql] (due to pyarrow error)
> -
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> ~^$ pip install pyspark[sql]^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> 

[jira] [Updated] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error)

2021-04-07 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-34979:

Description: 
~^$ pip install pyspark[sql] --prefer-binary^~

~^Collecting pyarrow>=1.0.0^~
 ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
 ~^Installing build dependencies ... done^~
 ~^Getting requirements to build wheel ... done^~
 ~^Preparing wheel metadata ... done^~
 ~^// ... ...^~
 ~^Building wheels for collected packages: pyarrow^~
 ~^Building wheel for pyarrow (PEP 517) ... error^~
 ~^ERROR: Command errored out with exit status 1:^~
 ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
/tmp/tmpq0n5juib^~
 ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
 ~^Complete output (183 lines):^~

~^– Running cmake for pyarrow^~
 ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
-DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
-DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
-DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=off 
-DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off 
-DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
-DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
-DCMAKE_BUILD_TYPE=release /tmp/pip-install-sh0myu71/pyarrow^~
 ~^error: command 'cmake' failed with exit status 1^~
 ~^^~
 ~^ERROR: Failed building wheel for pyarrow^~
 ~^Failed to build pyarrow^~
 ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
installed directly^~

 

The pip installation would be failed, due to the dependency pyarrow install 
failed.

 

 

  was:
$ pip install pyspark[sql] --prefer-binary

Collecting pyarrow>=1.0.0
 Using cached pyarrow-3.0.0.tar.gz (682 kB)
 Installing build dependencies ... done
 Getting requirements to build wheel ... done
 Preparing wheel metadata ... done
 // ... ...
 Building wheels for collected packages: pyarrow
 Building wheel for pyarrow (PEP 517) ... error
 ERROR: Command errored out with exit status 1:
 command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel /tmp/tmpq0n5juib
 cwd: /tmp/pip-install-sh0myu71/pyarrow
 Complete output (183 lines):

– Running cmake for pyarrow
 cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
-DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
-DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
-DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=off 
-DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off 
-DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
-DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
-DCMAKE_BUILD_TYPE=release /tmp/pip-install-sh0myu71/pyarrow
 error: command 'cmake' failed with exit status 1
 
 ERROR: Failed building wheel for pyarrow
 Failed to build pyarrow
 ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
installed directly

 

The pip installation would be failed, due to the dependency pyarrow install 
failed.

 

 


> Failed to install pyspark[sql] (due to pyarrow error)
> -
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> ~^$ pip install pyspark[sql] --prefer-binary^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^error: 

[jira] [Comment Edited] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error)

2021-04-07 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316131#comment-17316131
 ] 

Yikun Jiang edited comment on SPARK-34979 at 4/7/21, 8:56 AM:
--

The pyarrow aarch64 support has been merged to pyarrow master.

[https://github.com/apache/arrow/pull/9285]

So, as the tmp workground, you could install nightly arrow wheels first, using:

 pip install --extra-index-url [https://pypi.fury.io/arrow-nightlies/] --pre 
pyarrow --prefer-binary

And the arrow community will release next version (maybe 4.0) to support 
install pyarrow in aarch64 a couple of weeks.[1]

 


was (Author: yikunkero):
As the tmp workground, you could install nightly arrow wheels first, using:

 pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ --pre 
pyarrow --prefer-binary

And the arrow community will release next version to support install pyarrow in 
aarch64 a couple of weeks.[1]



 

> Failed to install pyspark[sql] (due to pyarrow error)
> -
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> $ pip install pyspark[sql] --prefer-binary
> Collecting pyarrow>=1.0.0
>  Using cached pyarrow-3.0.0.tar.gz (682 kB)
>  Installing build dependencies ... done
>  Getting requirements to build wheel ... done
>  Preparing wheel metadata ... done
>  // ... ...
>  Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (PEP 517) ... error
>  ERROR: Command errored out with exit status 1:
>  command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib
>  cwd: /tmp/pip-install-sh0myu71/pyarrow
>  Complete output (183 lines):
> – Running cmake for pyarrow
>  cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow
>  error: command 'cmake' failed with exit status 1
>  
>  ERROR: Failed building wheel for pyarrow
>  Failed to build pyarrow
>  ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error)

2021-04-07 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-34979:

Description: 
$ pip install pyspark[sql] --prefer-binary

Collecting pyarrow>=1.0.0
 Using cached pyarrow-3.0.0.tar.gz (682 kB)
 Installing build dependencies ... done
 Getting requirements to build wheel ... done
 Preparing wheel metadata ... done
 // ... ...
 Building wheels for collected packages: pyarrow
 Building wheel for pyarrow (PEP 517) ... error
 ERROR: Command errored out with exit status 1:
 command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel /tmp/tmpq0n5juib
 cwd: /tmp/pip-install-sh0myu71/pyarrow
 Complete output (183 lines):

– Running cmake for pyarrow
 cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
-DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
-DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
-DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=off 
-DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off 
-DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
-DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
-DCMAKE_BUILD_TYPE=release /tmp/pip-install-sh0myu71/pyarrow
 error: command 'cmake' failed with exit status 1
 
 ERROR: Failed building wheel for pyarrow
 Failed to build pyarrow
 ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
installed directly

 

The pip installation would be failed, due to the dependency pyarrow install 
failed.

 

 

  was:
$ pip install pyspark[sql] --prefer-binary
// ... ...
Building wheels for collected packages: pyarrow
 Building wheel for pyarrow (PEP 517) ... error
 ERROR: Command errored out with exit status 1:
 command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel /tmp/tmpq0n5juib
 cwd: /tmp/pip-install-sh0myu71/pyarrow
 Complete output (183 lines):

-- Running cmake for pyarrow
 cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
-DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
-DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
-DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=off 
-DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off 
-DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
-DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
-DCMAKE_BUILD_TYPE=release /tmp/pip-install-sh0myu71/pyarrow
 error: command 'cmake' failed with exit status 1
 
 ERROR: Failed building wheel for pyarrow
Failed to build pyarrow
ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
installed directly

 

 


> Failed to install pyspark[sql] (due to pyarrow error)
> -
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> $ pip install pyspark[sql] --prefer-binary
> Collecting pyarrow>=1.0.0
>  Using cached pyarrow-3.0.0.tar.gz (682 kB)
>  Installing build dependencies ... done
>  Getting requirements to build wheel ... done
>  Preparing wheel metadata ... done
>  // ... ...
>  Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (PEP 517) ... error
>  ERROR: Command errored out with exit status 1:
>  command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib
>  cwd: /tmp/pip-install-sh0myu71/pyarrow
>  Complete output (183 lines):
> – Running cmake for pyarrow
>  cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow
>  error: command 'cmake' failed with exit status 1
>  
>  ERROR: Failed building wheel for pyarrow
>  Failed to build pyarrow
>  ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error)

2021-04-07 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316135#comment-17316135
 ] 

Yikun Jiang commented on SPARK-34979:
-

 Looks like we have 2 potential options in spark after pyarrow aarch64 support:

*option 1. Do nothing*, just let pip install the latest version, but 
installation(without strict pyarrow version requirement) maybe failed if the 
version <=3.0. 

*option 2. Bump pyarrow version >= next pyarrow version* in spark after pyarrow 
aarch64 support. [2]

[[1]https://github.com/apache/arrow/pull/9285|https://github.com/apache/arrow/pull/9285]

[2][https://github.com/apache/spark/blob/0aa2c284e4052cb57ebf7276ecc4867ea2d5f94f/python/setup.py#L259]

> Failed to install pyspark[sql] (due to pyarrow error)
> -
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> $ pip install pyspark[sql] --prefer-binary
> // ... ...
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (PEP 517) ... error
>  ERROR: Command errored out with exit status 1:
>  command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib
>  cwd: /tmp/pip-install-sh0myu71/pyarrow
>  Complete output (183 lines):
> -- Running cmake for pyarrow
>  cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow
>  error: command 'cmake' failed with exit status 1
>  
>  ERROR: Failed building wheel for pyarrow
> Failed to build pyarrow
> ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error)

2021-04-07 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316131#comment-17316131
 ] 

Yikun Jiang commented on SPARK-34979:
-

As the tmp workground, you could install nightly arrow wheels first, using:

 pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ --pre 
pyarrow --prefer-binary

And the arrow community will release next version to support install pyarrow in 
aarch64 a couple of weeks.[1]



 

> Failed to install pyspark[sql] (due to pyarrow error)
> -
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> $ pip install pyspark[sql] --prefer-binary
> // ... ...
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (PEP 517) ... error
>  ERROR: Command errored out with exit status 1:
>  command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib
>  cwd: /tmp/pip-install-sh0myu71/pyarrow
>  Complete output (183 lines):
> -- Running cmake for pyarrow
>  cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow
>  error: command 'cmake' failed with exit status 1
>  
>  ERROR: Failed building wheel for pyarrow
> Failed to build pyarrow
> ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error)

2021-04-07 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-34979:
---

 Summary: Failed to install pyspark[sql] (due to pyarrow error)
 Key: SPARK-34979
 URL: https://issues.apache.org/jira/browse/SPARK-34979
 Project: Spark
  Issue Type: Task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Yikun Jiang


$ pip install pyspark[sql] --prefer-binary
// ... ...
Building wheels for collected packages: pyarrow
 Building wheel for pyarrow (PEP 517) ... error
 ERROR: Command errored out with exit status 1:
 command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel /tmp/tmpq0n5juib
 cwd: /tmp/pip-install-sh0myu71/pyarrow
 Complete output (183 lines):

-- Running cmake for pyarrow
 cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
-DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
-DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
-DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=off 
-DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off 
-DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
-DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
-DCMAKE_BUILD_TYPE=release /tmp/pip-install-sh0myu71/pyarrow
 error: command 'cmake' failed with exit status 1
 
 ERROR: Failed building wheel for pyarrow
Failed to build pyarrow
ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
installed directly

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34975) skip.header.line.count dose not work in hive partitioned table

2021-04-07 Thread Junqing Cai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315926#comment-17315926
 ] 

Junqing Cai edited comment on SPARK-34975 at 4/7/21, 7:53 AM:
--

I think SPARK-11374 fix the problem in non-partitioned table

 

but the partitioned table still not work


was (Author: weiwei121723):
I think this issue fix the problem in non-partitioned table

https://issues.apache.org/jira/browse/SPARK-11374

 

but the partitioned table still not work

> skip.header.line.count dose not work in hive partitioned table
> --
>
> Key: SPARK-34975
> URL: https://issues.apache.org/jira/browse/SPARK-34975
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.4.0
>Reporter: Junqing Cai
>Priority: Major
>
> when I use hive2.4.0 create external hive partitioned table with csv file, 
> and 
> {code:java}
> TBLPROPERTIES (
> 'skip.header.line.count' = '1'
> )
> LOCATION ''{code}
>  
> the result of hive dosen't contain header, but result of spark also contain 
> header
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34969) Followup for Refactor TreeNode's children handling methods into specialized traits (SPARK-34906)

2021-04-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-34969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-34969.
---
Fix Version/s: 3.2.0
 Assignee: Ali Afroozeh
   Resolution: Fixed

> Followup for Refactor TreeNode's children handling methods into specialized 
> traits (SPARK-34906)
> 
>
> Key: SPARK-34969
> URL: https://issues.apache.org/jira/browse/SPARK-34969
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Major
> Fix For: 3.2.0
>
>
> This is a followup for https://issues.apache.org/jira/browse/SPARK-34906
> In this PR we:
>  * Introduce the QuaternaryLike trait for node types with 4 children.
>  * Specialize more node types
>  * Fix a number of style errors that were introduced in the original PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34975) skip.header.line.count dose not work in hive partitioned table

2021-04-07 Thread Junqing Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junqing Cai updated SPARK-34975:

Summary: skip.header.line.count dose not work in hive partitioned table  
(was: skip.header.line.count is not work in hive partitioned table)

> skip.header.line.count dose not work in hive partitioned table
> --
>
> Key: SPARK-34975
> URL: https://issues.apache.org/jira/browse/SPARK-34975
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.4.0
>Reporter: Junqing Cai
>Priority: Major
>
> when I use spark2.4 create external hive partitioned table with csv file, and 
> {code:java}
> TBLPROPERTIES (
> 'skip.header.line.count' = '1'
> )
> LOCATION ''{code}
>  
> the result of hive dosen't contain header, but result of spark also contain 
> header
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34975) skip.header.line.count dose not work in hive partitioned table

2021-04-07 Thread Junqing Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junqing Cai updated SPARK-34975:

Description: 
when I use hive2.4.0 create external hive partitioned table with csv file, and 
{code:java}
TBLPROPERTIES (
'skip.header.line.count' = '1'
)
LOCATION ''{code}
 

the result of hive dosen't contain header, but result of spark also contain 
header

 

 

  was:
when I use spark2.4 create external hive partitioned table with csv file, and 
{code:java}
TBLPROPERTIES (
'skip.header.line.count' = '1'
)
LOCATION ''{code}
 

the result of hive dosen't contain header, but result of spark also contain 
header

 

 


> skip.header.line.count dose not work in hive partitioned table
> --
>
> Key: SPARK-34975
> URL: https://issues.apache.org/jira/browse/SPARK-34975
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.4.0
>Reporter: Junqing Cai
>Priority: Major
>
> when I use hive2.4.0 create external hive partitioned table with csv file, 
> and 
> {code:java}
> TBLPROPERTIES (
> 'skip.header.line.count' = '1'
> )
> LOCATION ''{code}
>  
> the result of hive dosen't contain header, but result of spark also contain 
> header
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34762) Many PR's Scala 2.13 build action failed

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316078#comment-17316078
 ] 

Apache Spark commented on SPARK-34762:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32078

> Many PR's Scala 2.13 build action failed
> 
>
> Key: SPARK-34762
> URL: https://issues.apache.org/jira/browse/SPARK-34762
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> PR with Scala 2.13 build failure includes 
>  * [https://github.com/apache/spark/pull/31849]
>  * [https://github.com/apache/spark/pull/31848]
>  * [https://github.com/apache/spark/pull/31844]
>  * [https://github.com/apache/spark/pull/31843]
>  * https://github.com/apache/spark/pull/31841
> {code:java}
> [error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:26:1:
>   error: package org.apache.commons.cli does not exist
> 1278[error] import org.apache.commons.cli.GnuParser;
> 1279[error]  ^
> 1280[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:176:1:
>   error: cannot find symbol
> 1281[error] private final Options options = new Options();
> 1282[error]   ^  symbol:   class Options
> 1283[error]   location: class ServerOptionsProcessor
> 1284[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:177:1:
>   error: package org.apache.commons.cli does not exist
> 1285[error] private org.apache.commons.cli.CommandLine commandLine;
> 1286[error]   ^
> 1287[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:255:1:
>   error: cannot find symbol
> 1288[error] HelpOptionExecutor(String serverName, Options options) {
> 1289[error]   ^  symbol:   class 
> Options
> 1290[error]   location: class HelpOptionExecutor
> 1291[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:176:1:
>   error: cannot find symbol
> 1292[error] private final Options options = new Options();
> 1293[error] ^  symbol:   class Options
> 1294[error]   location: class ServerOptionsProcessor
> 1295[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:185:1:
>   error: cannot find symbol
> 1296[error]   options.addOption(OptionBuilder
> 1297[error] ^  symbol:   variable OptionBuilder
> 1298[error]   location: class ServerOptionsProcessor
> 1299[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:192:1:
>   error: cannot find symbol
> 1300[error]   options.addOption(new Option("H", "help", false, "Print 
> help information"));
> 1301[error] ^  symbol:   class Option
> 1302[error]   location: class ServerOptionsProcessor
> 1303[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:197:1:
>   error: cannot find symbol
> 1304[error] commandLine = new GnuParser().parse(options, argv);
> 1305[error]   ^  symbol:   class GnuParser
> 1306[error]   location: class ServerOptionsProcessor
> 1307[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:211:1:
>   error: cannot find symbol
> 1308[error]   } catch (ParseException e) {
> 1309[error]^  symbol:   class ParseException
> 1310[error]   location: class ServerOptionsProcessor
> 1311[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:262:1:
>   error: cannot find symbol
> 1312[error]   new HelpFormatter().printHelp(serverName, options);
> 1313[error]   ^  symbol:   class HelpFormatter
> 1314[error]   location: class HelpOptionExecutor
> 1315[error] Note: Some input files use or override a deprecated API.
> 1316[error] Note: Recompile with -Xlint:deprecation for details.
> 1317[error] 16 errors
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Commented] (SPARK-34502) Remove unused parameters in join methods

2021-04-07 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-34502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316056#comment-17316056
 ] 

Kadir Selçuk commented on SPARK-34502:
--

Apache k with Jira problems

> Remove unused parameters in join methods
> 
>
> Key: SPARK-34502
> URL: https://issues.apache.org/jira/browse/SPARK-34502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Remove unused parameters in some join methods



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34922) Use better CBO cost function

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316050#comment-17316050
 ] 

Apache Spark commented on SPARK-34922:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32076

> Use better CBO cost function
> 
>
> Key: SPARK-34922
> URL: https://issues.apache.org/jira/browse/SPARK-34922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.2.0
>
>
> In SPARK-33935 we changed the CBO cost function such that it would be 
> symetric - A.betterThan(B) implies that !B.betterThan(A). Before both could 
> have been true.
> That change introduced a performance regression in some queries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33357) Support SparkLauncher in Kubernetes

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316049#comment-17316049
 ] 

Apache Spark commented on SPARK-33357:
--

User 'grarkydev' has created a pull request for this issue:
https://github.com/apache/spark/pull/32077

> Support SparkLauncher in Kubernetes
> ---
>
> Key: SPARK-33357
> URL: https://issues.apache.org/jira/browse/SPARK-33357
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: hong dongdong
>Priority: Major
>
> Now, SparkAppHandle can not get state report in k8s, we can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34922) Use better CBO cost function

2021-04-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316051#comment-17316051
 ] 

Apache Spark commented on SPARK-34922:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32075

> Use better CBO cost function
> 
>
> Key: SPARK-34922
> URL: https://issues.apache.org/jira/browse/SPARK-34922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.2.0
>
>
> In SPARK-33935 we changed the CBO cost function such that it would be 
> symetric - A.betterThan(B) implies that !B.betterThan(A). Before both could 
> have been true.
> That change introduced a performance regression in some queries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34978) Support time-travel SQL syntax

2021-04-07 Thread Yann Byron (Jira)
Yann Byron created SPARK-34978:
--

 Summary: Support time-travel SQL syntax
 Key: SPARK-34978
 URL: https://issues.apache.org/jira/browse/SPARK-34978
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.1
Reporter: Yann Byron


Some DataSource have the ability to query the older snapshot, like Delta.

The syntax may be like `TIMESTAMP AS OF '

2018-10-18 22:15:12' ` to query the snapshot closest to this time point, or 
`VERSION AS OF 12` to query the 12th snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34975) skip.header.line.count is not work in hive partitioned table

2021-04-07 Thread Junqing Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junqing Cai updated SPARK-34975:

Description: 
when I use spark2.4 create external hive partitioned table with csv file, and 
{code:java}
TBLPROPERTIES (
'skip.header.line.count' = '1'
)
LOCATION ''{code}
 

the result of hive dosen't contain header, but result of spark also contain 
header

 

 

  was:
when I use spark2.4 create external hive partitioned table with csv file, and 
{code:java}
TBLPROPERTIES (
'skip.header.line.count' = '1'
)
LOCATION ''{code}
 

the result of hive dosen't contain header, but result of spark also contain 
header

 

but if I create table in hive, the result of spark is correct

 


> skip.header.line.count is not work in hive partitioned table
> 
>
> Key: SPARK-34975
> URL: https://issues.apache.org/jira/browse/SPARK-34975
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.4.0
>Reporter: Junqing Cai
>Priority: Major
>
> when I use spark2.4 create external hive partitioned table with csv file, and 
> {code:java}
> TBLPROPERTIES (
> 'skip.header.line.count' = '1'
> )
> LOCATION ''{code}
>  
> the result of hive dosen't contain header, but result of spark also contain 
> header
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34977) LIST FILES/JARS/ARCHIVES cannot handle multiple arguments properly when at least one path is quoted.

2021-04-07 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-34977:
---
Description: 
`LIST FILES/JARS/ARCHIVES path1 path2 ...` cannot list all paths if at least 
one path is quoted.
An example here.

{code}
ADD FILE /tmp/test1;
ADD FILE /tmp/test2;

LIST FILES /tmp/test1 /tmp/test2;
file:/tmp/test1
file:/tmp/test2

LIST FILES /tmp/test1 "/tmp/test2";
file:/tmp/test2
{code}

In this example, the second `LIST FILES` doesn't show `file:/tmp/test1`.


  was:
`LIST {FILES/JARS/ARCHIVES} path1, path2, ...` cannot list all paths if at 
least one path is quoted.
An example here.

{code}
ADD FILE /tmp/test1;
ADD FILE /tmp/test2;

LIST FILES /tmp/test1 /tmp/test2;
file:/tmp/test1
file:/tmp/test2

LIST FILES /tmp/test1 "/tmp/test2";
file:/tmp/test2
{code}

In this example, the second `LIST FILES` doesn't show `file:/tmp/test1`.



> LIST FILES/JARS/ARCHIVES cannot handle multiple arguments properly when at 
> least one path is quoted.
> 
>
> Key: SPARK-34977
> URL: https://issues.apache.org/jira/browse/SPARK-34977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> `LIST FILES/JARS/ARCHIVES path1 path2 ...` cannot list all paths if at least 
> one path is quoted.
> An example here.
> {code}
> ADD FILE /tmp/test1;
> ADD FILE /tmp/test2;
> LIST FILES /tmp/test1 /tmp/test2;
> file:/tmp/test1
> file:/tmp/test2
> LIST FILES /tmp/test1 "/tmp/test2";
> file:/tmp/test2
> {code}
> In this example, the second `LIST FILES` doesn't show `file:/tmp/test1`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org