date:20200704

[jira] [Resolved] (SPARK-32171) change file locations for use db and refresh table

2020-07-04 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-32171.

Resolution: Fixed

> change file locations for use db and refresh table
> --
>
> Key: SPARK-32171
> URL: https://issues.apache.org/jira/browse/SPARK-32171
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Trivial
>
> docs/sql-ref-syntax-aux-refresh-table.md -> 
> docs/sql-ref-syntax-aux-cache-refresh-table.md
> docs/sql-ref-syntax-qry-select-usedb.md -> docs/sql-ref-syntax-ddl-usedb.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32172) Use createDirectory instead of mkdir

2020-07-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32172:


Assignee: (was: Apache Spark)

> Use createDirectory instead of mkdir
> 
>
> Key: SPARK-32172
> URL: https://issues.apache.org/jira/browse/SPARK-32172
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: Zuo Dao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32172) Use createDirectory instead of mkdir

2020-07-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151455#comment-17151455
 ] 

Apache Spark commented on SPARK-32172:
--

User 'sidedoorleftroad' has created a pull request for this issue:
https://github.com/apache/spark/pull/28997

> Use createDirectory instead of mkdir
> 
>
> Key: SPARK-32172
> URL: https://issues.apache.org/jira/browse/SPARK-32172
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: Zuo Dao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32172) Use createDirectory instead of mkdir

2020-07-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32172:


Assignee: Apache Spark

> Use createDirectory instead of mkdir
> 
>
> Key: SPARK-32172
> URL: https://issues.apache.org/jira/browse/SPARK-32172
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: Zuo Dao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32172) Use createDirectory instead of mkdir

2020-07-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151454#comment-17151454
 ] 

Apache Spark commented on SPARK-32172:
--

User 'sidedoorleftroad' has created a pull request for this issue:
https://github.com/apache/spark/pull/28997

> Use createDirectory instead of mkdir
> 
>
> Key: SPARK-32172
> URL: https://issues.apache.org/jira/browse/SPARK-32172
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: Zuo Dao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32172) Use createDirectory instead of mkdir

2020-07-04 Thread Zuo Dao (Jira)

Zuo Dao created SPARK-32172:
---

 Summary: Use createDirectory instead of mkdir
 Key: SPARK-32172
 URL: https://issues.apache.org/jira/browse/SPARK-32172
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0, 2.4.0, 2.3.0
Reporter: Zuo Dao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-07-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151431#comment-17151431
 ] 

Apache Spark commented on SPARK-29358:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/28996

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-07-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29358:


Assignee: Apache Spark

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Assignee: Apache Spark
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-07-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29358:


Assignee: (was: Apache Spark)

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151392#comment-17151392
 ] 

Dongjoon Hyun edited comment on SPARK-32167 at 7/4/20, 6:20 PM:


After looking at the code, I realize that this is a very old code path from 
1.5.0 or older. This should be the following and 2.3.4 and older also have this 
bug.
{code:java}
scala> df.select($"arr".getField("i")).printSchema
root
 |-- arr.i: array (nullable = true)
 ||-- element: integer (containsNull = true){code}


was (Author: dongjoon):
After looking at the code, I realize that this is a very old code path from 
1.5.0. This should be the following and 2.3.4 and older also have this bug.
{code:java}
scala> df.select($"arr".getField("i")).printSchema
root
 |-- arr.i: array (nullable = true)
 ||-- element: integer (containsNull = true){code}

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>
> The following should be `Array([WrappedArray(1, null)])` instead of 
> `Array([WrappedArray(1, 0)])`
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{ArrayType, StructType}
> val innerStruct = new StructType().add("i", "int", nullable = true)
> val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull 
> = false))
> val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, 
> schema)
> df.select($"arr".getField("i")).collect
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151392#comment-17151392
 ] 

Dongjoon Hyun edited comment on SPARK-32167 at 7/4/20, 6:20 PM:


After looking at the code, I realize that this is a very old code path from 
1.5.0 or older. This should be the following. But 2.3.4 and older also have 
this bug.
{code:java}
scala> df.select($"arr".getField("i")).printSchema
root
 |-- arr.i: array (nullable = true)
 ||-- element: integer (containsNull = true){code}


was (Author: dongjoon):
After looking at the code, I realize that this is a very old code path from 
1.5.0 or older. This should be the following and 2.3.4 and older also have this 
bug.
{code:java}
scala> df.select($"arr".getField("i")).printSchema
root
 |-- arr.i: array (nullable = true)
 ||-- element: integer (containsNull = true){code}

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>
> The following should be `Array([WrappedArray(1, null)])` instead of 
> `Array([WrappedArray(1, 0)])`
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{ArrayType, StructType}
> val innerStruct = new StructType().add("i", "int", nullable = true)
> val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull 
> = false))
> val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, 
> schema)
> df.select($"arr".getField("i")).collect
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32167:
--
Affects Version/s: 1.6.3

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>
> The following should be `Array([WrappedArray(1, null)])` instead of 
> `Array([WrappedArray(1, 0)])`
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{ArrayType, StructType}
> val innerStruct = new StructType().add("i", "int", nullable = true)
> val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull 
> = false))
> val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, 
> schema)
> df.select($"arr".getField("i")).collect
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32167:
--
Affects Version/s: (was: 2.4.5)
   (was: 2.4.4)
   (was: 2.4.3)
   (was: 2.4.2)
   (was: 2.4.1)
   (was: 2.4.0)
   2.0.2
   2.1.3
   2.2.3
   2.3.4

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>
> The following should be `Array([WrappedArray(1, null)])` instead of 
> `Array([WrappedArray(1, 0)])`
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{ArrayType, StructType}
> val innerStruct = new StructType().add("i", "int", nullable = true)
> val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull 
> = false))
> val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, 
> schema)
> df.select($"arr".getField("i")).collect
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32167:
--
Description: 
The following should be `Array([WrappedArray(1, null)])` instead of 
`Array([WrappedArray(1, 0)])`
{code:java}
import scala.collection.JavaConverters._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{ArrayType, StructType}

val innerStruct = new StructType().add("i", "int", nullable = true)
val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull = 
false))
val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, schema)
df.select($"arr".getField("i")).collect
{code}
 

  was:
The following should be `Array([WrappedArray(1, null)])` instead of 
`Array([WrappedArray(1, 0)])`
{code:java}
import scala.collection.JavaConverters._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{ArrayType, StructType}val innerStruct = new 
StructType().add("i", "int", nullable = true)

val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull = 
false))
val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, schema)
df.select($"arr".getField("i")).collect{code}
 


> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>
> The following should be `Array([WrappedArray(1, null)])` instead of 
> `Array([WrappedArray(1, 0)])`
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{ArrayType, StructType}
> val innerStruct = new StructType().add("i", "int", nullable = true)
> val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull 
> = false))
> val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, 
> schema)
> df.select($"arr".getField("i")).collect
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151392#comment-17151392
 ] 

Dongjoon Hyun commented on SPARK-32167:
---

After looking at the code, I realize that this is a very old code path from 
1.5.0. This should be the following and 2.3.4 and older also have this bug.
{code:java}
scala> df.select($"arr".getField("i")).printSchema
root
 |-- arr.i: array (nullable = true)
 ||-- element: integer (containsNull = true){code}

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>
> The following should be `Array([WrappedArray(1, null)])` instead of 
> `Array([WrappedArray(1, 0)])`
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{ArrayType, StructType}
> val innerStruct = new StructType().add("i", "int", nullable = true)
> val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull 
> = false))
> val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, 
> schema)
> df.select($"arr".getField("i")).collect
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151391#comment-17151391
 ] 

Dongjoon Hyun edited comment on SPARK-32167 at 7/4/20, 6:06 PM:


With the above example, I verified 2.4.0 and 2.4.5 also have this bug while 
2.3.4 doesn't have this. I updated the `Affected Versions`.


was (Author: dongjoon):
I verified 2.4.0 and 2.4.5 also have this bug while 2.3.4 doesn't have this. I 
updated the `Affected Versions`.

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>
> The following should be `Array([WrappedArray(1, null)])` instead of 
> `Array([WrappedArray(1, 0)])`
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{ArrayType, StructType}val innerStruct = 
> new StructType().add("i", "int", nullable = true)
> val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull 
> = false))
> val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, 
> schema)
> df.select($"arr".getField("i")).collect{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151391#comment-17151391
 ] 

Dongjoon Hyun commented on SPARK-32167:
---

I verified 2.4.0 and 2.4.5 also have this bug while 2.3.4 doesn't have this. I 
updated the `Affected Versions`.

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>
> The following should be `Array([WrappedArray(1, null)])` instead of 
> `Array([WrappedArray(1, 0)])`
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{ArrayType, StructType}val innerStruct = 
> new StructType().add("i", "int", nullable = true)
> val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull 
> = false))
> val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, 
> schema)
> df.select($"arr".getField("i")).collect{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32167:
--
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3
   2.4.4
   2.4.5

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>
> The following should be `Array([WrappedArray(1, null)])` instead of 
> `Array([WrappedArray(1, 0)])`
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{ArrayType, StructType}val innerStruct = 
> new StructType().add("i", "int", nullable = true)
> val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull 
> = false))
> val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, 
> schema)
> df.select($"arr".getField("i")).collect{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32167:
--
Description: 
The following should be `Array([WrappedArray(1, null)])` instead of 
`Array([WrappedArray(1, 0)])`
{code:java}
import scala.collection.JavaConverters._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{ArrayType, StructType}val innerStruct = new 
StructType().add("i", "int", nullable = true)

val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull = 
false))
val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, schema)
df.select($"arr".getField("i")).collect{code}
 

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>
> The following should be `Array([WrappedArray(1, null)])` instead of 
> `Array([WrappedArray(1, 0)])`
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{ArrayType, StructType}val innerStruct = 
> new StructType().add("i", "int", nullable = true)
> val schema = new StructType().add("arr", ArrayType(innerStruct, containsNull 
> = false))
> val df = spark.createDataFrame(List(Row(Seq(Row(1), Row(null.asJava, 
> schema)
> df.select($"arr".getField("i")).collect{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151385#comment-17151385
 ] 

Dongjoon Hyun commented on SPARK-32167:
---

Thank you. I raised the issue as a `Blocker` for `2.4.7/3.0.1` since this is a 
correctness issue.

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32167:
--
Target Version/s: 2.4.7, 3.0.1

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32167:
--
Priority: Blocker  (was: Critical)

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-31666.
-

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.executor.instances=2 \
>  --conf spark.kubernetes.container.image=my-spark-image:my-tag \
>  --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
>  --conf spark.kubernetes.namespace=my-spark-ns \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 
> \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1
>  \
>  --conf spark.local.dir="/tmp1" \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
>  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code}
> Any ideas on what's causing this?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31666.
---
Resolution: Not A Problem

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.executor.instances=2 \
>  --conf spark.kubernetes.container.image=my-spark-image:my-tag \
>  --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
>  --conf spark.kubernetes.namespace=my-spark-ns \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 
> \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1
>  \
>  --conf spark.local.dir="/tmp1" \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
>  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code}
> Any ideas on what's causing this?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151363#comment-17151363
 ] 

Dongjoon Hyun edited comment on SPARK-31666 at 7/4/20, 4:42 PM:


First of all, the following is Spark 2.4 behavior since 2.4.0. It's not a bug.
{quote}In Spark 2.4, the `LocalDirsFeatureStep` iterates through the list of 
paths in `spark.local.dir`. For each one, it creates a Kubernetes volume of 
mount type `emptyDir` with the name `spark-local-dir-${index}`.
{quote}
The following is a wrong use case because Spark 2.4.x features are not designed 
for that. 
{quote}The issue is that I need my Spark job to use paths from my host machine 
that are on a mount point that isn't part of the directory which Kubernetes 
uses to allocate space for `emptyDir` volumes. Therefore, I mount these paths 
as type `hostPath` and ask Spark to use them as local directory space.
{quote}
Please note that the error message came from K8s. Apache Spark starts to 
support your use case in Apache Spark 3.0 by adding new features.

I guess you are confused on the issue types in the open source projects. Not 
only Apache Spark, All Apache Projects distinguishes `New Improvement` and `New 
Feature` from `Bug`. Many new improvements and features are adding those kind 
of unsupported stuffs inside old versions. We cannot backport everything into 
old branches. All committers and developers are already moving to Apache Spark 
3.1.0.

 

For the following, you misunderstand again. We didn't kill 2.4.x like 1.6.x. 
Historically, 1.6 was killed at 1.6.3. For 2.4.x, you can still use Apache 
Spark 2.4.6 and more. That's the reason why Apache Spark community declared 2.4 
as LTS (long term support). [https://spark.apache.org/versioning-policy.html.] 
We will maintain with critical bug fixes and security fixes like 
[https://spark.apache.org/security.html] . However, 2.4.7 (or 2.4.8) will be 
the same with 2.4.0~2.4.6 in terms of the features. That's the community policy.

bq. I feel giving folks 6 months to migrate from one Spark release to the next 
is fair, especially now considering how mature Spark is as a project. What are 
your thoughts on this?


was (Author: dongjoon):
First of all, the following is Spark 2.4 behavior since 2.4.0. It's not a bug.
{quote}In Spark 2.4, the `LocalDirsFeatureStep` iterates through the list of 
paths in `spark.local.dir`. For each one, it creates a Kubernetes volume of 
mount type `emptyDir` with the name `spark-local-dir-${index}`.
{quote}
The following is a wrong use case because Spark 2.4.x features are not designed 
for that. 
{quote}The issue is that I need my Spark job to use paths from my host machine 
that are on a mount point that isn't part of the directory which Kubernetes 
uses to allocate space for `emptyDir` volumes. Therefore, I mount these paths 
as type `hostPath` and ask Spark to use them as local directory space.
{quote}
Please note that the error message came from K8s. Apache Spark starts to 
support your use case in Apache Spark 3.0 by adding new features.

I guess you are confused on the issue types in the open source projects. Not 
only Apache Spark, All Apache Projects distinguishes `New Improvement` and `New 
Feature` from `Bug`. Many new improvements and features are adding those kind 
of unsupported stuffs inside old versions. We cannot backport everything into 
old branches. All committers and developers are already moving to Apache Spark 
3.1.0.

 

For the following, you misunderstand again. We didn't kill 2.4.x like 1.6.x. 
Historically, 1.6 was killed at 1.6.3. For 2.4.x, you can still use Apache 
Spark 2.4.6 and more. That's the reason why Apache Spark community declared 2.4 
as LTS (long term support). [https://spark.apache.org/versioning-policy.html.] 
We will maintain with critical bug fixes and security fixes like 
[https://spark.apache.org/security.html] . However, 2.4.7 (or 2.4.8) will be 
the same with 2.4.0~2.4.6 in terms of the features. That's the community policy.

>  I feel giving folks 6 months to migrate from one Spark release to the next 
>is fair, especially now considering how mature Spark is as a project. What are 
>your thoughts on this?

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-

[jira] [Commented] (SPARK-32171) change file locations for use db and refresh table

2020-07-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151362#comment-17151362
 ] 

Apache Spark commented on SPARK-32171:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28995

> change file locations for use db and refresh table
> --
>
> Key: SPARK-32171
> URL: https://issues.apache.org/jira/browse/SPARK-32171
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Trivial
>
> docs/sql-ref-syntax-aux-refresh-table.md -> 
> docs/sql-ref-syntax-aux-cache-refresh-table.md
> docs/sql-ref-syntax-qry-select-usedb.md -> docs/sql-ref-syntax-ddl-usedb.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151363#comment-17151363
 ] 

Dongjoon Hyun commented on SPARK-31666:
---

First of all, the following is Spark 2.4 behavior since 2.4.0. It's not a bug.
{quote}In Spark 2.4, the `LocalDirsFeatureStep` iterates through the list of 
paths in `spark.local.dir`. For each one, it creates a Kubernetes volume of 
mount type `emptyDir` with the name `spark-local-dir-${index}`.
{quote}
The following is a wrong use case because Spark 2.4.x features are not designed 
for that. 
{quote}The issue is that I need my Spark job to use paths from my host machine 
that are on a mount point that isn't part of the directory which Kubernetes 
uses to allocate space for `emptyDir` volumes. Therefore, I mount these paths 
as type `hostPath` and ask Spark to use them as local directory space.
{quote}
Please note that the error message came from K8s. Apache Spark starts to 
support your use case in Apache Spark 3.0 by adding new features.

I guess you are confused on the issue types in the open source projects. Not 
only Apache Spark, All Apache Projects distinguishes `New Improvement` and `New 
Feature` from `Bug`. Many new improvements and features are adding those kind 
of unsupported stuffs inside old versions. We cannot backport everything into 
old branches. All committers and developers are already moving to Apache Spark 
3.1.0.

 

For the following, you misunderstand again. We didn't kill 2.4.x like 1.6.x. 
Historically, 1.6 was killed at 1.6.3. For 2.4.x, you can still use Apache 
Spark 2.4.6 and more. That's the reason why Apache Spark community declared 2.4 
as LTS (long term support). [https://spark.apache.org/versioning-policy.html.] 
We will maintain with critical bug fixes and security fixes like 
[https://spark.apache.org/security.html] . However, 2.4.7 (or 2.4.8) will be 
the same with 2.4.0~2.4.6 in terms of the features. That's the community policy.

>  I feel giving folks 6 months to migrate from one Spark release to the next 
>is fair, especially now considering how mature Spark is as a project. What are 
>your thoughts on this?

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.executor.instances=2 \
>  --conf spark.kubernetes.container.image=my-spark-image:my-tag \
>  --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
>  --conf spark.kubernetes.namespace=my-spark-ns \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 
> \
>  --conf 
> spark.kuber

[jira] [Assigned] (SPARK-32171) change file locations for use db and refresh table

2020-07-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32171:


Assignee: Apache Spark

> change file locations for use db and refresh table
> --
>
> Key: SPARK-32171
> URL: https://issues.apache.org/jira/browse/SPARK-32171
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Trivial
>
> docs/sql-ref-syntax-aux-refresh-table.md -> 
> docs/sql-ref-syntax-aux-cache-refresh-table.md
> docs/sql-ref-syntax-qry-select-usedb.md -> docs/sql-ref-syntax-ddl-usedb.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32171) change file locations for use db and refresh table

2020-07-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32171:


Assignee: (was: Apache Spark)

> change file locations for use db and refresh table
> --
>
> Key: SPARK-32171
> URL: https://issues.apache.org/jira/browse/SPARK-32171
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Trivial
>
> docs/sql-ref-syntax-aux-refresh-table.md -> 
> docs/sql-ref-syntax-aux-cache-refresh-table.md
> docs/sql-ref-syntax-qry-select-usedb.md -> docs/sql-ref-syntax-ddl-usedb.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32171) change file locations for use db and refresh table

2020-07-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151361#comment-17151361
 ] 

Apache Spark commented on SPARK-32171:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28995

> change file locations for use db and refresh table
> --
>
> Key: SPARK-32171
> URL: https://issues.apache.org/jira/browse/SPARK-32171
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Trivial
>
> docs/sql-ref-syntax-aux-refresh-table.md -> 
> docs/sql-ref-syntax-aux-cache-refresh-table.md
> docs/sql-ref-syntax-qry-select-usedb.md -> docs/sql-ref-syntax-ddl-usedb.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32171) change file locations for use db and refresh table

2020-07-04 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-32171:
--

 Summary: change file locations for use db and refresh table
 Key: SPARK-32171
 URL: https://issues.apache.org/jira/browse/SPARK-32171
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Huaxin Gao


docs/sql-ref-syntax-aux-refresh-table.md -> 
docs/sql-ref-syntax-aux-cache-refresh-table.md
docs/sql-ref-syntax-qry-select-usedb.md -> docs/sql-ref-syntax-ddl-usedb.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-07-04 Thread Syedhamjath (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151349#comment-17151349
 ] 

Syedhamjath commented on SPARK-29358:
-

I also came across same issue, if the data frame does not have a column. It 
makes sense to add column with null or keep the self data frame as fixed ignore 
additional columns from the parameter data frame based on additional parameter.

I'm getting confused when it throw an error message, it should not throw an 
error on missing column or additional columns.

I vote for this issue, if Spark doesn't solve I don't think anything can solve 
this problem.

 

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32170) Improve the speculation for the inefficient tasks by the task metrics.

2020-07-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32170:

Fix Version/s: (was: 3.0.0)

>  Improve the speculation for the inefficient tasks by the task metrics.
> ---
>
> Key: SPARK-32170
> URL: https://issues.apache.org/jira/browse/SPARK-32170
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: weixiuli
>Priority: Major
>
> 1) Tasks will be speculated when meet certain conditions no matter they are 
> inefficient or not，this would be a huge waste of cluster resources.
> 2) In production，the speculation task comes  from an efficient one  will be 
> killed finally，which is unnecessary and will waste of cluster resources.
> 3) So, we should  evaluate whether the task is inefficient by success tasks 
> metrics firstly, and then decide to speculate it or not. The  inefficient 
> task will be speculated and efficient one will not, it is better for the 
> cluster resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32170) Improve the speculation for the inefficient tasks by the task metrics.

2020-07-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32170:


Assignee: (was: Apache Spark)

>  Improve the speculation for the inefficient tasks by the task metrics.
> ---
>
> Key: SPARK-32170
> URL: https://issues.apache.org/jira/browse/SPARK-32170
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 3.0.0
>
>
> 1) Tasks will be speculated when meet certain conditions no matter they are 
> inefficient or not，this would be a huge waste of cluster resources.
> 2) In production，the speculation task comes  from an efficient one  will be 
> killed finally，which is unnecessary and will waste of cluster resources.
> 3) So, we should  evaluate whether the task is inefficient by success tasks 
> metrics firstly, and then decide to speculate it or not. The  inefficient 
> task will be speculated and efficient one will not, it is better for the 
> cluster resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32170) Improve the speculation for the inefficient tasks by the task metrics.

2020-07-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151288#comment-17151288
 ] 

Apache Spark commented on SPARK-32170:
--

User 'weixiuli' has created a pull request for this issue:
https://github.com/apache/spark/pull/28994

>  Improve the speculation for the inefficient tasks by the task metrics.
> ---
>
> Key: SPARK-32170
> URL: https://issues.apache.org/jira/browse/SPARK-32170
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 3.0.0
>
>
> 1) Tasks will be speculated when meet certain conditions no matter they are 
> inefficient or not，this would be a huge waste of cluster resources.
> 2) In production，the speculation task comes  from an efficient one  will be 
> killed finally，which is unnecessary and will waste of cluster resources.
> 3) So, we should  evaluate whether the task is inefficient by success tasks 
> metrics firstly, and then decide to speculate it or not. The  inefficient 
> task will be speculated and efficient one will not, it is better for the 
> cluster resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32170) Improve the speculation for the inefficient tasks by the task metrics.

2020-07-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32170:


Assignee: Apache Spark

>  Improve the speculation for the inefficient tasks by the task metrics.
> ---
>
> Key: SPARK-32170
> URL: https://issues.apache.org/jira/browse/SPARK-32170
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: weixiuli
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.0
>
>
> 1) Tasks will be speculated when meet certain conditions no matter they are 
> inefficient or not，this would be a huge waste of cluster resources.
> 2) In production，the speculation task comes  from an efficient one  will be 
> killed finally，which is unnecessary and will waste of cluster resources.
> 3) So, we should  evaluate whether the task is inefficient by success tasks 
> metrics firstly, and then decide to speculate it or not. The  inefficient 
> task will be speculated and efficient one will not, it is better for the 
> cluster resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32170) Improve the speculation for the inefficient tasks by the task metrics.

2020-07-04 Thread weixiuli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-32170:
-
Description: 
1) Tasks will be speculated when meet certain conditions no matter they are 
inefficient or not，this would be a huge waste of cluster resources.
2) In production，the speculation task comes  from an efficient one  will be 
killed finally，which is unnecessary and will waste of cluster resources.
3) So, we should  evaluate whether the task is inefficient by success tasks 
metrics firstly, and then decide to speculate it or not. The  inefficient task 
will be speculated and efficient one will not, it is better for the cluster 
resources.


  was:
1) Tasks will be speculated when meet certain conditions no matter they are 
inefficient or not，this would be a huge waste of cluster resources.
2) In production，the speculation task from an efficient one  will be killed 
finally，which is unnecessary and will waste of cluster resources.
3) So, we should  evaluate whether the task is inefficient by success tasks 
metrics firstly, and then decide to speculate it or not. The  inefficient task 
will be speculated and efficient one will not, it is better for the cluster 
resources.



>  Improve the speculation for the inefficient tasks by the task metrics.
> ---
>
> Key: SPARK-32170
> URL: https://issues.apache.org/jira/browse/SPARK-32170
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 3.0.0
>
>
> 1) Tasks will be speculated when meet certain conditions no matter they are 
> inefficient or not，this would be a huge waste of cluster resources.
> 2) In production，the speculation task comes  from an efficient one  will be 
> killed finally，which is unnecessary and will waste of cluster resources.
> 3) So, we should  evaluate whether the task is inefficient by success tasks 
> metrics firstly, and then decide to speculate it or not. The  inefficient 
> task will be speculated and efficient one will not, it is better for the 
> cluster resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32170) Improve the speculation for the inefficient tasks by the task metrics.

2020-07-04 Thread weixiuli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-32170:
-
Description: 
1) Tasks will be speculated when meet certain conditions no matter they are 
inefficient or not，this would be a huge waste of cluster resources.
2) In production，the speculation task from an efficient one  will be killed 
finally，which is unnecessary and will waste of cluster resources.
3) So, we should  evaluate whether the task is inefficient by success tasks 
metrics firstly, and then decide to speculate it or not. The  inefficient task 
will be speculated and efficient one will not, it is better for the cluster 
resources.


  was:
1) Tasks will be speculated when meet certain conditions no matter they are 
inefficient or not，this would be a huge waste of cluster resources.
2) In production，the speculation task from an efficient one  will be killed 
finally，which is unnecessary and will waste of cluster resources.
3) So, we should  evaluate whether the task is inefficient by success tasks 
metrics firstly, and then decide to speculate it or not. The  inefficient task 
will be speculated and efficient one will not, it better for the cluster 
resources.



>  Improve the speculation for the inefficient tasks by the task metrics.
> ---
>
> Key: SPARK-32170
> URL: https://issues.apache.org/jira/browse/SPARK-32170
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 3.0.0
>
>
> 1) Tasks will be speculated when meet certain conditions no matter they are 
> inefficient or not，this would be a huge waste of cluster resources.
> 2) In production，the speculation task from an efficient one  will be killed 
> finally，which is unnecessary and will waste of cluster resources.
> 3) So, we should  evaluate whether the task is inefficient by success tasks 
> metrics firstly, and then decide to speculate it or not. The  inefficient 
> task will be speculated and efficient one will not, it is better for the 
> cluster resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-07-04 Thread qingwu.fu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151133#comment-17151133
 ] 

qingwu.fu edited comment on SPARK-30602 at 7/4/20, 8:04 AM:


hi [~mshen], could I have the access auth to see the source code? Thanks.

Or, how can i contribute to this issue?


was (Author: qingwu.fu):
hi [~mshen], could I have the access auth to see the source code? Thanks.

Or when will this work open source?

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_2020_magnet_shuffle.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32170) Improve the speculation for the inefficient tasks by the task metrics.

2020-07-04 Thread weixiuli (Jira)

weixiuli created SPARK-32170:


 Summary:  Improve the speculation for the inefficient tasks by the 
task metrics.
 Key: SPARK-32170
 URL: https://issues.apache.org/jira/browse/SPARK-32170
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Spark Core
Affects Versions: 3.0.0
Reporter: weixiuli
 Fix For: 3.0.0


1) Tasks will be speculated when meet certain conditions no matter they are 
inefficient or not，this would be a huge waste of cluster resources.
2) In production，the speculation task from an efficient one  will be killed 
finally，which is unnecessary and will waste of cluster resources.
3) So, we should  evaluate whether the task is inefficient by success tasks 
metrics firstly, and then decide to speculate it or not. The  inefficient task 
will be speculated and efficient one will not, it better for the cluster 
resources.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

40 matches

Mail list logo