[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181632#comment-17181632
 ] 

Jungtaek Lim commented on SPARK-32672:
--

Just FYI, he's a PMC member. And correctness issue goes normally a blocker 
unless there's some strong reason to not address the issue right now.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-32672:
-
Priority: Blocker  (was: Critical)

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32660) Show Avro related API in documentation

2020-08-20 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-32660.

Resolution: Fixed

The issue is resolved in https://github.com/apache/spark/pull/29476

> Show Avro related API in documentation
> --
>
> Key: SPARK-32660
> URL: https://issues.apache.org/jira/browse/SPARK-32660
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the Avro related APIs are missing in the documentation 
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . 
> This PR is to:
> 1. Mark internal Avro related classes as private
> 2. Show Avro related API in Spark official API documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32660) Show Avro related API in documentation

2020-08-20 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-32660:
---
Description: 
Currently, the Avro related APIs are missing in the documentation 
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . 
This PR is to:
1. Mark internal Avro related classes as private
2. Show Avro related API in Spark official API documentation

> Show Avro related API in documentation
> --
>
> Key: SPARK-32660
> URL: https://issues.apache.org/jira/browse/SPARK-32660
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the Avro related APIs are missing in the documentation 
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . 
> This PR is to:
> 1. Mark internal Avro related classes as private
> 2. Show Avro related API in Spark official API documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32660) Show Avro related API in documentation

2020-08-20 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181617#comment-17181617
 ] 

Gengliang Wang commented on SPARK-32660:


[~rohitmishr1484] sure

> Show Avro related API in documentation
> --
>
> Key: SPARK-32660
> URL: https://issues.apache.org/jira/browse/SPARK-32660
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the Avro related APIs are missing in the documentation 
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . 
> This PR is to:
> 1. Mark internal Avro related classes as private
> 2. Show Avro related API in Spark official API documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181561#comment-17181561
 ] 

Apache Spark commented on SPARK-32676:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29501

> Fix double caching in KMeans/BiKMeans
> -
>
> Key: SPARK-32676
> URL: https://issues.apache.org/jira/browse/SPARK-32676
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> In the .mllib side, if the storageLevel of input {{data}} is always ignored 
> and cached twice:
> {code:java}
> @Since("0.8.0")
> def run(data: RDD[Vector]): KMeansModel = {
>   val instances = data.map(point => (point, 1.0))
>   runWithWeight(instances, None)
> }
>  {code}
> {code:java}
> private[spark] def runWithWeight(
> data: RDD[(Vector, Double)],
> instr: Option[Instrumentation]): KMeansModel = {
>   // Compute squared norms and cache them.
>   val norms = data.map { case (v, _) =>
> Vectors.norm(v, 2.0)
>   }
>   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
> new VectorWithNorm(v, norm, w)
>   }
>   if (data.getStorageLevel == StorageLevel.NONE) {
> zippedData.persist(StorageLevel.MEMORY_AND_DISK)
>   }
>   val model = runAlgorithmWithWeight(zippedData, instr)
>   zippedData.unpersist()
>   model
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32676:


Assignee: Apache Spark

> Fix double caching in KMeans/BiKMeans
> -
>
> Key: SPARK-32676
> URL: https://issues.apache.org/jira/browse/SPARK-32676
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.1.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> In the .mllib side, if the storageLevel of input {{data}} is always ignored 
> and cached twice:
> {code:java}
> @Since("0.8.0")
> def run(data: RDD[Vector]): KMeansModel = {
>   val instances = data.map(point => (point, 1.0))
>   runWithWeight(instances, None)
> }
>  {code}
> {code:java}
> private[spark] def runWithWeight(
> data: RDD[(Vector, Double)],
> instr: Option[Instrumentation]): KMeansModel = {
>   // Compute squared norms and cache them.
>   val norms = data.map { case (v, _) =>
> Vectors.norm(v, 2.0)
>   }
>   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
> new VectorWithNorm(v, norm, w)
>   }
>   if (data.getStorageLevel == StorageLevel.NONE) {
> zippedData.persist(StorageLevel.MEMORY_AND_DISK)
>   }
>   val model = runAlgorithmWithWeight(zippedData, instr)
>   zippedData.unpersist()
>   model
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32676:


Assignee: (was: Apache Spark)

> Fix double caching in KMeans/BiKMeans
> -
>
> Key: SPARK-32676
> URL: https://issues.apache.org/jira/browse/SPARK-32676
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> In the .mllib side, if the storageLevel of input {{data}} is always ignored 
> and cached twice:
> {code:java}
> @Since("0.8.0")
> def run(data: RDD[Vector]): KMeansModel = {
>   val instances = data.map(point => (point, 1.0))
>   runWithWeight(instances, None)
> }
>  {code}
> {code:java}
> private[spark] def runWithWeight(
> data: RDD[(Vector, Double)],
> instr: Option[Instrumentation]): KMeansModel = {
>   // Compute squared norms and cache them.
>   val norms = data.map { case (v, _) =>
> Vectors.norm(v, 2.0)
>   }
>   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
> new VectorWithNorm(v, norm, w)
>   }
>   if (data.getStorageLevel == StorageLevel.NONE) {
> zippedData.persist(StorageLevel.MEMORY_AND_DISK)
>   }
>   val model = runAlgorithmWithWeight(zippedData, instr)
>   zippedData.unpersist()
>   model
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181560#comment-17181560
 ] 

Apache Spark commented on SPARK-32676:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29501

> Fix double caching in KMeans/BiKMeans
> -
>
> Key: SPARK-32676
> URL: https://issues.apache.org/jira/browse/SPARK-32676
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> In the .mllib side, if the storageLevel of input {{data}} is always ignored 
> and cached twice:
> {code:java}
> @Since("0.8.0")
> def run(data: RDD[Vector]): KMeansModel = {
>   val instances = data.map(point => (point, 1.0))
>   runWithWeight(instances, None)
> }
>  {code}
> {code:java}
> private[spark] def runWithWeight(
> data: RDD[(Vector, Double)],
> instr: Option[Instrumentation]): KMeansModel = {
>   // Compute squared norms and cache them.
>   val norms = data.map { case (v, _) =>
> Vectors.norm(v, 2.0)
>   }
>   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
> new VectorWithNorm(v, norm, w)
>   }
>   if (data.getStorageLevel == StorageLevel.NONE) {
> zippedData.persist(StorageLevel.MEMORY_AND_DISK)
>   }
>   val model = runAlgorithmWithWeight(zippedData, instr)
>   zippedData.unpersist()
>   model
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29967) KMeans support instance weighting

2020-08-20 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181558#comment-17181558
 ] 

zhengruifeng commented on SPARK-29967:
--

[~YuQiang Ye] I open ticket SPARK-32676 for this issue, and send a pr 
https://github.com/apache/spark/pull/29501

> KMeans support instance weighting
> -
>
> Key: SPARK-29967
> URL: https://issues.apache.org/jira/browse/SPARK-29967
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Since https://issues.apache.org/jira/browse/SPARK-9610, we start to support 
> instance weighting in ML.
> However, Clustering and other impl in features still do not support instance 
> weighting.
> I think we need to start support weighting in KMeans, like what scikit-learn 
> does.
> It will contains three parts:
> 1, move the impl from .mllib to .ml
> 2, make .mllib.KMeans as a wrapper of .ml.KMeans
> 3, support instance weighting in the .ml.KMeans



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32676:
-
Description: 
In the .mllib side, if the storageLevel of input {{data}} is always ignored and 
cached twice:
{code:java}
@Since("0.8.0")
def run(data: RDD[Vector]): KMeansModel = {
  val instances = data.map(point => (point, 1.0))
  runWithWeight(instances, None)
}
 {code}
{code:java}
private[spark] def runWithWeight(
data: RDD[(Vector, Double)],
instr: Option[Instrumentation]): KMeansModel = {

  // Compute squared norms and cache them.
  val norms = data.map { case (v, _) =>
Vectors.norm(v, 2.0)
  }

  val zippedData = data.zip(norms).map { case ((v, w), norm) =>
new VectorWithNorm(v, norm, w)
  }

  if (data.getStorageLevel == StorageLevel.NONE) {
zippedData.persist(StorageLevel.MEMORY_AND_DISK)
  }
  val model = runAlgorithmWithWeight(zippedData, instr)
  zippedData.unpersist()

  model
} {code}

  was:
In the .mllib side, if the storageLevel of input {{data}} is always ignored and 
cached twice:
{code:java}
@Since("0.8.0")
def run(data: RDD[Vector]): KMeansModel = {
  val instances = data.map(point => (point, 1.0))
  runWithWeight(instances, None)
}{code}


> Fix double caching in KMeans/BiKMeans
> -
>
> Key: SPARK-32676
> URL: https://issues.apache.org/jira/browse/SPARK-32676
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> In the .mllib side, if the storageLevel of input {{data}} is always ignored 
> and cached twice:
> {code:java}
> @Since("0.8.0")
> def run(data: RDD[Vector]): KMeansModel = {
>   val instances = data.map(point => (point, 1.0))
>   runWithWeight(instances, None)
> }
>  {code}
> {code:java}
> private[spark] def runWithWeight(
> data: RDD[(Vector, Double)],
> instr: Option[Instrumentation]): KMeansModel = {
>   // Compute squared norms and cache them.
>   val norms = data.map { case (v, _) =>
> Vectors.norm(v, 2.0)
>   }
>   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
> new VectorWithNorm(v, norm, w)
>   }
>   if (data.getStorageLevel == StorageLevel.NONE) {
> zippedData.persist(StorageLevel.MEMORY_AND_DISK)
>   }
>   val model = runAlgorithmWithWeight(zippedData, instr)
>   zippedData.unpersist()
>   model
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-32676:


 Summary: Fix double caching in KMeans/BiKMeans
 Key: SPARK-32676
 URL: https://issues.apache.org/jira/browse/SPARK-32676
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0, 3.1.0
Reporter: zhengruifeng


In the .mllib side, if the storageLevel of input {{data}} is always ignored and 
cached twice:
{code:java}
@Since("0.8.0")
def run(data: RDD[Vector]): KMeansModel = {
  val instances = data.map(point => (point, 1.0))
  runWithWeight(instances, None)
}{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181553#comment-17181553
 ] 

Lantao Jin commented on SPARK-32672:


Changed to Critical, Blocker is reserved for committer

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32672:
---
Priority: Critical  (was: Blocker)

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Farhan Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Farhan Khan updated SPARK-32675:

Description: 
Submitted application passing --py-files option in a hardcoded manner for a 
Mesos Cluster in cluster mode using REST Submission API. It is causing a simple 
Java-based SparkPi job to fail.

This Bug is introduced by SPARK-26466.

Here is the example job submission:
{code:bash}
curl -X POST http://localhost:7077/v1/submissions/create --header 
"Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
  "spark.jars": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
  "spark.driver.supervise": "false",
  "spark.executor.memory": "512m",
  "spark.driver.memory": "512m",
  "spark.submit.deployMode": "cluster",
  "spark.app.name": "SparkPi",
  "spark.master": "mesos://localhost:5050"
}}'
{code}
Expected Driver log would contain:
{code:bash}
20/08/20 20:19:57 WARN DependencyUtils: Local jar 
/var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
 does not exist, skipping.
Error: Failed to load class org.apache.spark.examples.SparkPi.
20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
{code}

  was:
Submitted application passing --py-files option in a hardcoded manner for a 
Mesos Cluster in cluster mode using REST Submission API. It is causing a simple 
Java-based SparkPi job to fail.

This Bug is introduced by SPARK-26466.

Here is the example job submission:
{code:bash}
curl -X POST http://localhost:7077/v1/submissions/create --header 
"Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
  "spark.jars": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
  "spark.driver.supervise": "false",
  "spark.executor.memory": "512m",
  "spark.driver.memory": "512m",
  "spark.submit.deployMode": "cluster",
  "spark.app.name": "SparkPi",
  "spark.master": "mesos://localhost:5050"
}}'
{code}

Expected Dispatcher output would contain:

{code:bash}
20/08/20 20:19:57 WARN DependencyUtils: Local jar 
/var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
 does not exist, skipping.
Error: Failed to load class org.apache.spark.examples.SparkPi.
20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
{code}


> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Driver log would contain:
> {code:bash}
> 20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, 

[jira] [Commented] (SPARK-32673) Pyspark/cloudpickle.py - no module named 'wfdb'

2020-08-20 Thread Sandy Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181526#comment-17181526
 ] 

Sandy Su commented on SPARK-32673:
--

df_signals = df_record_names.repartition('record_name').select(
 df_record_names.record_id,
 extract_signals_udf(df_record_names.record_name).alias('signal_info'))

df_signals = df_signals.select(df_signals.record_id,
 df_signals.signal_info.patient_id.alias('patient_id'),
 df_signals.signal_info.comments.alias('comments'),
 df_signals.signal_info.signals.alias('signals'))

display(df_signals.drop('signals'))

> Pyspark/cloudpickle.py - no module named 'wfdb'
> ---
>
> Key: SPARK-32673
> URL: https://issues.apache.org/jira/browse/SPARK-32673
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Sandy Su
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Running Spark in a Databricks notebook.
>  
> Ran into this issue when executing a cell:
> (1) Spark Jobs
> SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 68, 
> 10.139.64.5, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb' During handling 
> of the above exception, another exception occurred: Traceback (most recent 
> call last): File "/databricks/spark/python/pyspark/worker.py", line 644, in 
> main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type) File "/databricks/spark/python/pyspark/worker.py", line 463, in 
> read_udfs udfs.append(read_single_udf(pickleSer, infile, eval_type, 
> runner_conf, udf_index=i)) File "/databricks/spark/python/pyspark/worker.py", 
> line 254, in read_single_udf f, return_type = read_command(pickleSer, infile) 
> File "/databricks/spark/python/pyspark/worker.py", line 74, in read_command 
> command = serializer._read_with_length(file) File 
> "/databricks/spark/python/pyspark/serializers.py", line 180, in 
> _read_with_length raise SerializationError("Caused by " + 
> traceback.format_exc()) pyspark.serializers.SerializationError: Caused by 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181524#comment-17181524
 ] 

Apache Spark commented on SPARK-32667:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29500

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32667:


Assignee: (was: Apache Spark)

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181523#comment-17181523
 ] 

Apache Spark commented on SPARK-32667:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29500

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32667:


Assignee: Apache Spark

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32667:
--
Description: 
Scrip transform no-serde mode should pad null value to filling column
{code:java}
hive> SELECT TRANSFORM(a, b)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY '|'
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'NULL'
> USING 'cat' as (a string, b string, c string, d string)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY '|'
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'NULL'
> FROM (
> select 1 as a, 2 as b
> ) tmp ;
OK
1   2   NULLNULL
Time taken: 24.626 seconds, Fetched: 1 row(s)

{code}

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32632.
--
Resolution: Not A Problem

> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When I use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method,  for  rows of the 
> first partition aren't limited by the lowerBound and the ones of the last 
> partition are not limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition should be " `id` >=2 and `id` < 3 
> "  because the lowerBound is 2, and the clause of the last partition should 
> be " `id` >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-20 Thread Liu Dinghua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181517#comment-17181517
 ] 

Liu Dinghua commented on SPARK-32632:
-

Thanks, when partitioning , what if we put the lowerBound in the where clause 
of the first partition and the upperBound in the last partition? does  it 
result in anything worse ?

> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When I use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method,  for  rows of the 
> first partition aren't limited by the lowerBound and the ones of the last 
> partition are not limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition should be " `id` >=2 and `id` < 3 
> "  because the lowerBound is 2, and the clause of the last partition should 
> be " `id` >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Farhan Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Farhan Khan updated SPARK-32675:

Comment: was deleted

(was: Implementing PR: [https://github.com/apache/spark/pull/29499])

> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would contain:
> {code:bash}
> 20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipping.
> Error: Failed to load class org.apache.spark.examples.SparkPi.
> 20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Farhan Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Farhan Khan updated SPARK-32675:

Description: 
Submitted application passing --py-files option in a hardcoded manner for a 
Mesos Cluster in cluster mode using REST Submission API. It is causing a simple 
Java-based SparkPi job to fail.

This Bug is introduced by SPARK-26466.

Here is the example job submission:
{code:bash}
curl -X POST http://localhost:7077/v1/submissions/create --header 
"Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
  "spark.jars": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
  "spark.driver.supervise": "false",
  "spark.executor.memory": "512m",
  "spark.driver.memory": "512m",
  "spark.submit.deployMode": "cluster",
  "spark.app.name": "SparkPi",
  "spark.master": "mesos://localhost:5050"
}}'
{code}

Expected Dispatcher output would contain:

{code:bash}
20/08/20 20:19:57 WARN DependencyUtils: Local jar 
/var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
 does not exist, skipping.
Error: Failed to load class org.apache.spark.examples.SparkPi.
20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
{code}

  was:
Submitted application passing --py-files option in a hardcoded manner for a 
Mesos Cluster in cluster mode using REST Submission API. It is causing a simple 
Java-based SparkPi job to fail.

This Bug is introduced by SPARK-26466.

Here is the example job submission:
{code:bash}
curl -X POST http://localhost:7077/v1/submissions/create --header 
"Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
  "spark.jars": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
  "spark.driver.supervise": "false",
  "spark.executor.memory": "512m",
  "spark.driver.memory": "512m",
  "spark.submit.deployMode": "cluster",
  "spark.app.name": "SparkPi",
  "spark.master": "mesos://localhost:5050"
}}'
{code}
Expected Dispatcher output would contain:
{code}
Unable to find source-code formatter for language: log. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
/var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
 does not exist, skipping.
Error: Failed to load class org.apache.spark.examples.SparkPi.
20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
{code}


> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output 

[jira] [Commented] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Farhan Khan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181513#comment-17181513
 ] 

Farhan Khan commented on SPARK-32675:
-

Implementing PR: [https://github.com/apache/spark/pull/29499]

> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would contain:
> {code}
> Unable to find source-code formatter for language: log. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipping.
> Error: Failed to load class org.apache.spark.examples.SparkPi.
> 20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181514#comment-17181514
 ] 

Apache Spark commented on SPARK-32675:
--

User 'farhan5900' has created a pull request for this issue:
https://github.com/apache/spark/pull/29499

> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would contain:
> {code}
> Unable to find source-code formatter for language: log. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipping.
> Error: Failed to load class org.apache.spark.examples.SparkPi.
> 20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32675:


Assignee: (was: Apache Spark)

> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would contain:
> {code}
> Unable to find source-code formatter for language: log. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipping.
> Error: Failed to load class org.apache.spark.examples.SparkPi.
> 20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32675:


Assignee: Apache Spark

> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Assignee: Apache Spark
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would contain:
> {code}
> Unable to find source-code formatter for language: log. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipping.
> Error: Failed to load class org.apache.spark.examples.SparkPi.
> 20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Farhan Khan (Jira)
Farhan Khan created SPARK-32675:
---

 Summary: --py-files option is appended without passing value for it
 Key: SPARK-32675
 URL: https://issues.apache.org/jira/browse/SPARK-32675
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 3.0.0
Reporter: Farhan Khan


Submitted application passing --py-files option in a hardcoded manner for a 
Mesos Cluster in cluster mode using REST Submission API. It is causing a simple 
Java-based SparkPi job to fail.

This Bug is introduced by SPARK-26466.

Here is the example job submission:
{code:bash}
curl -X POST http://localhost:7077/v1/submissions/create --header 
"Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
  "spark.jars": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
  "spark.driver.supervise": "false",
  "spark.executor.memory": "512m",
  "spark.driver.memory": "512m",
  "spark.submit.deployMode": "cluster",
  "spark.app.name": "SparkPi",
  "spark.master": "mesos://localhost:5050"
}}'
{code}
Expected Dispatcher output would contain:
{code}
Unable to find source-code formatter for language: log. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
/var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
 does not exist, skipping.
Error: Failed to load class org.apache.spark.examples.SparkPi.
20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32673) Pyspark/cloudpickle.py - no module named 'wfdb'

2020-08-20 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181504#comment-17181504
 ] 

Takeshi Yamamuro commented on SPARK-32673:
--

could you please show us a example query to reproduce this?

> Pyspark/cloudpickle.py - no module named 'wfdb'
> ---
>
> Key: SPARK-32673
> URL: https://issues.apache.org/jira/browse/SPARK-32673
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Sandy Su
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Running Spark in a Databricks notebook.
>  
> Ran into this issue when executing a cell:
> (1) Spark Jobs
> SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 68, 
> 10.139.64.5, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb' During handling 
> of the above exception, another exception occurred: Traceback (most recent 
> call last): File "/databricks/spark/python/pyspark/worker.py", line 644, in 
> main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type) File "/databricks/spark/python/pyspark/worker.py", line 463, in 
> read_udfs udfs.append(read_single_udf(pickleSer, infile, eval_type, 
> runner_conf, udf_index=i)) File "/databricks/spark/python/pyspark/worker.py", 
> line 254, in read_single_udf f, return_type = read_command(pickleSer, infile) 
> File "/databricks/spark/python/pyspark/worker.py", line 74, in read_command 
> command = serializer._read_with_length(file) File 
> "/databricks/spark/python/pyspark/serializers.py", line 180, in 
> _read_with_length raise SerializationError("Caused by " + 
> traceback.format_exc()) pyspark.serializers.SerializationError: Caused by 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32673) Pyspark/cloudpickle.py - no module named 'wfdb'

2020-08-20 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32673:
-
Flags:   (was: Important)

> Pyspark/cloudpickle.py - no module named 'wfdb'
> ---
>
> Key: SPARK-32673
> URL: https://issues.apache.org/jira/browse/SPARK-32673
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Sandy Su
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Running Spark in a Databricks notebook.
>  
> Ran into this issue when executing a cell:
> (1) Spark Jobs
> SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 68, 
> 10.139.64.5, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb' During handling 
> of the above exception, another exception occurred: Traceback (most recent 
> call last): File "/databricks/spark/python/pyspark/worker.py", line 644, in 
> main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type) File "/databricks/spark/python/pyspark/worker.py", line 463, in 
> read_udfs udfs.append(read_single_udf(pickleSer, infile, eval_type, 
> runner_conf, udf_index=i)) File "/databricks/spark/python/pyspark/worker.py", 
> line 254, in read_single_udf f, return_type = read_command(pickleSer, infile) 
> File "/databricks/spark/python/pyspark/worker.py", line 74, in read_command 
> command = serializer._read_with_length(file) File 
> "/databricks/spark/python/pyspark/serializers.py", line 180, in 
> _read_with_length raise SerializationError("Caused by " + 
> traceback.format_exc()) pyspark.serializers.SerializationError: Caused by 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-32672:

Affects Version/s: 3.1.0

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32674) Add suggestion for parallel directory listing in tuning doc

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32674:


Assignee: Apache Spark

> Add suggestion for parallel directory listing in tuning doc
> ---
>
> Key: SPARK-32674
> URL: https://issues.apache.org/jira/browse/SPARK-32674
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Minor
>
> Sometimes directory listing could become a bottleneck when user jobs have 
> large number of input directories. This is especially true when against 
> object store like S3. 
> There are a few parameters to tune this. This proposes to add some info in 
> the tuning guide so that the knowledge can be better shared. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32674) Add suggestion for parallel directory listing in tuning doc

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32674:


Assignee: (was: Apache Spark)

> Add suggestion for parallel directory listing in tuning doc
> ---
>
> Key: SPARK-32674
> URL: https://issues.apache.org/jira/browse/SPARK-32674
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
>
> Sometimes directory listing could become a bottleneck when user jobs have 
> large number of input directories. This is especially true when against 
> object store like S3. 
> There are a few parameters to tune this. This proposes to add some info in 
> the tuning guide so that the knowledge can be better shared. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32674) Add suggestion for parallel directory listing in tuning doc

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181493#comment-17181493
 ] 

Apache Spark commented on SPARK-32674:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29498

> Add suggestion for parallel directory listing in tuning doc
> ---
>
> Key: SPARK-32674
> URL: https://issues.apache.org/jira/browse/SPARK-32674
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
>
> Sometimes directory listing could become a bottleneck when user jobs have 
> large number of input directories. This is especially true when against 
> object store like S3. 
> There are a few parameters to tune this. This proposes to add some info in 
> the tuning guide so that the knowledge can be better shared. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32646) ORC predicate pushdown should work with case-insensitive analysis

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181492#comment-17181492
 ] 

Apache Spark commented on SPARK-32646:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29498

> ORC predicate pushdown should work with case-insensitive analysis
> -
>
> Key: SPARK-32646
> URL: https://issues.apache.org/jira/browse/SPARK-32646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently ORC predicate pushdown doesn't work with case-insensitive analysis, 
> see SPARK-32622 for the test case.
> We should make ORC predicate pushdown work with case-insensitive analysis too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32674) Add suggestion for parallel directory listing in tuning doc

2020-08-20 Thread Chao Sun (Jira)
Chao Sun created SPARK-32674:


 Summary: Add suggestion for parallel directory listing in tuning 
doc
 Key: SPARK-32674
 URL: https://issues.apache.org/jira/browse/SPARK-32674
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Chao Sun


Sometimes directory listing could become a bottleneck when user jobs have large 
number of input directories. This is especially true when against object store 
like S3. 

There are a few parameters to tune this. This proposes to add some info in the 
tuning guide so that the knowledge can be better shared. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181486#comment-17181486
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I added some debugging to the compression code and it looks like in the 8th 
CompressedBatch of 10,000 entries the number of nulls seen was different from 
the number expected.

619 expected and 618 seen.  I'll try to debug this a bit more tomorrow.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181478#comment-17181478
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I did a little debugging and found that `BooleanBitSet$Encoder` is being used 
for compression.  There are other data orderings that use the same encoder and 
produce correct results though.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181468#comment-17181468
 ] 

Thomas Graves commented on SPARK-32672:
---

[~cloud_fan] [~ruifengz]

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32672:
--
Affects Version/s: 3.0.1

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32672:
--
Labels: correctness  (was: )

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181466#comment-17181466
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I verified that this is still happening on 3.1.0-SNAPSHOT  too

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32640:
-
Labels: correctness  (was: )

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.0
>
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32640:
-
Labels: correction  (was: )

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correction
> Fix For: 3.1.0
>
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32640:
-
Labels:   (was: correction)

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181459#comment-17181459
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I verified that this is still happening on 3.0.2-SNAPSHOT

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-32672:

Affects Version/s: 2.4.6

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32673) Pyspark/cloudpickle.py - no module named 'wfdb'

2020-08-20 Thread Sandy Su (Jira)
Sandy Su created SPARK-32673:


 Summary: Pyspark/cloudpickle.py - no module named 'wfdb'
 Key: SPARK-32673
 URL: https://issues.apache.org/jira/browse/SPARK-32673
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Sandy Su


Running Spark in a Databricks notebook.

 

Ran into this issue when executing a cell:



(1) Spark Jobs

SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 17.0 (TID 68, 10.139.64.5, 
executor 0): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 
177, in _read_with_length return self.loads(obj) File 
"/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
pickle.loads(obj, encoding=encoding) File 
"/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
__import__(name) ModuleNotFoundError: No module named 'wfdb' During handling of 
the above exception, another exception occurred: Traceback (most recent call 
last): File "/databricks/spark/python/pyspark/worker.py", line 644, in main 
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
eval_type) File "/databricks/spark/python/pyspark/worker.py", line 463, in 
read_udfs udfs.append(read_single_udf(pickleSer, infile, eval_type, 
runner_conf, udf_index=i)) File "/databricks/spark/python/pyspark/worker.py", 
line 254, in read_single_udf f, return_type = read_command(pickleSer, infile) 
File "/databricks/spark/python/pyspark/worker.py", line 74, in read_command 
command = serializer._read_with_length(file) File 
"/databricks/spark/python/pyspark/serializers.py", line 180, in 
_read_with_length raise SerializationError("Caused by " + 
traceback.format_exc()) pyspark.serializers.SerializationError: Caused by 
Traceback (most recent call last): File 
"/databricks/spark/python/pyspark/serializers.py", line 177, in 
_read_with_length return self.loads(obj) File 
"/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
pickle.loads(obj, encoding=encoding) File 
"/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
__import__(name) ModuleNotFoundError: No module named 'wfdb'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-32672:

Summary: Data corruption in some cached compressed boolean columns  (was: 
Daat corruption in some cached compressed boolean columns)

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-32672:

Attachment: bad_order.snappy.parquet

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32672) Daat corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)
Robert Joseph Evans created SPARK-32672:
---

 Summary: Daat corruption in some cached compressed boolean columns
 Key: SPARK-32672
 URL: https://issues.apache.org/jira/browse/SPARK-32672
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Robert Joseph Evans
 Attachments: bad_order.snappy.parquet

I found that when sorting some boolean data into the cache that the results can 
change when the data is read back out.

It needs to be a non-trivial amount of data, and it is highly dependent on the 
order of the data.  If I disable compression in the cache the issue goes away.  
I was able to make this happen in 3.0.0.  I am going to try and reproduce it in 
other versions too.

I'll attach the parquet file with boolean data in an order that causes this to 
happen. As you can see after the data is cached a single null values switches 
over to be false.

{code}
scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
bad_order: org.apache.spark.sql.DataFrame = [b: boolean]

scala> bad_order.groupBy("b").count.show
+-+-+
|b|count|
+-+-+
| null| 7153|
| true|54334|
|false|54021|
+-+-+


scala> bad_order.cache()
res1: bad_order.type = [b: boolean]

scala> bad_order.groupBy("b").count.show
+-+-+
|b|count|
+-+-+
| null| 7152|
| true|54334|
|false|54022|
+-+-+


scala> 

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181447#comment-17181447
 ] 

Apache Spark commented on SPARK-32670:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29497

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32670:


Assignee: Xiao Li  (was: Apache Spark)

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181448#comment-17181448
 ] 

Apache Spark commented on SPARK-32670:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29497

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32670:


Assignee: Apache Spark  (was: Xiao Li)

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-31214) Upgrade Janino to 3.1.2

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-31214.
-

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31214) Upgrade Janino to 3.1.2

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31214.
---
Resolution: Invalid

I close this issue as `Invalid` because this is reverted due to the correctness 
issue reported at SPARK-32640.

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31214) Upgrade Janino to 3.1.2

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-31214:
---
  Assignee: (was: Jungtaek Lim)

This is reverted via SPARK-32640

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31214) Upgrade Janino to 3.1.2

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31214:
--
Fix Version/s: (was: 3.1.0)

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31101) Upgrade Janino to 3.0.16

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31101:
--
Fix Version/s: 2.4.6
   3.0.0

> Upgrade Janino to 3.0.16
> 
>
> Key: SPARK-31101
> URL: https://issues.apache.org/jira/browse/SPARK-31101
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> We got some report on failure on user's query which Janino throws error on 
> compiling generated code. The issue is here: janino-compiler/janino#113 It 
> contains the information of generated code, symptom (error), and analysis of 
> the bug, so please refer the link for more details.
> Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable 
> Janino to succeed to compile user's query properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31101) Upgrade Janino to 3.0.16

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31101:
--
Fix Version/s: (was: 2.4.6)
   (was: 3.0.0)

> Upgrade Janino to 3.0.16
> 
>
> Key: SPARK-31101
> URL: https://issues.apache.org/jira/browse/SPARK-31101
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> We got some report on failure on user's query which Janino throws error on 
> compiling generated code. The issue is here: janino-compiler/janino#113 It 
> contains the information of generated code, symptom (error), and analysis of 
> the bug, so please refer the link for more details.
> Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable 
> Janino to succeed to compile user's query properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32640:
-

Assignee: Wenchen Fan

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32640.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29495
[https://github.com/apache/spark/pull/29495]

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved SPARK-32671.

Resolution: Invalid

I was mistaken about this issue.

> Race condition in MapOutputTracker.getStatistics
> 
>
> Key: SPARK-32671
> URL: https://issues.apache.org/jira/browse/SPARK-32671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Andy Grove
>Priority: Major
>
> MapOutputTracker.getStatistics builds an array of partition sizes for a 
> shuffle id and in some cases uses multiple threads running in parallel to 
> update this array. This code is not thread-safe and the output is 
> non-deterministic when there are multiple MapStatus entries for the same 
> partition.
> We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite 
> that depend on the output being deterministic, and intermittent failures in 
> these tests led me to track this bug down.
> The issue is trivial to fix by using an AtomicLong when building the array of 
> partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32660) Show Avro related API in documentation

2020-08-20 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181404#comment-17181404
 ] 

Rohit Mishra commented on SPARK-32660:
--

[~Gengliang.Wang], Can you please add a description?

> Show Avro related API in documentation
> --
>
> Key: SPARK-32660
> URL: https://issues.apache.org/jira/browse/SPARK-32660
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32668) HiveGenericUDTF initialize UDTF should use StructObjectInspector method

2020-08-20 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181403#comment-17181403
 ] 

Rohit Mishra commented on SPARK-32668:
--

[~ulysses], Can you please add a description?

> HiveGenericUDTF initialize UDTF should use StructObjectInspector method
> ---
>
> Key: SPARK-32668
> URL: https://issues.apache.org/jira/browse/SPARK-32668
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32669) test expression nullability when checking result

2020-08-20 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181401#comment-17181401
 ] 

Rohit Mishra commented on SPARK-32669:
--

[~cloud_fan], Can you please add a description?

> test expression nullability when checking result
> 
>
> Key: SPARK-32669
> URL: https://issues.apache.org/jira/browse/SPARK-32669
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181400#comment-17181400
 ] 

Rohit Mishra commented on SPARK-32670:
--

[~smilegator], Can you kindly add more to the description? It will be helpful 
for others. 

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181393#comment-17181393
 ] 

Rohit Mishra commented on SPARK-32667:
--

[~angerszhuuu], Can you please add a description?

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24266) Spark client terminates while driver is still running

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181363#comment-17181363
 ] 

Apache Spark commented on SPARK-24266:
--

User 'jkleckner' has created a pull request for this issue:
https://github.com/apache/spark/pull/29496

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Chun Chen
>Priority: Major
> Fix For: 3.1.0
>
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, 

[jira] [Commented] (SPARK-24266) Spark client terminates while driver is still running

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181362#comment-17181362
 ] 

Apache Spark commented on SPARK-24266:
--

User 'jkleckner' has created a pull request for this issue:
https://github.com/apache/spark/pull/29496

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Chun Chen
>Priority: Major
> Fix For: 3.1.0
>
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, 

[jira] [Commented] (SPARK-31800) Unable to disable Kerberos when submitting jobs to Kubernetes

2020-08-20 Thread James Boylan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181349#comment-17181349
 ] 

James Boylan commented on SPARK-31800:
--

This issue definitely seems to persist. I would note that this testing was done 
on Hadoop 3.2 pre-built Spark 3.0 version. I have not done testing on the 
Hadoop 2.7 pre-built Spark 3.0 version.

[~jagadeesh.n] - Were you testing on the same version?

> Unable to disable Kerberos when submitting jobs to Kubernetes
> -
>
> Key: SPARK-31800
> URL: https://issues.apache.org/jira/browse/SPARK-31800
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: James Boylan
>Priority: Major
>
> When you attempt to submit a process to Kubernetes using spark-submit through 
> --master, it returns the exception:
> {code:java}
> 20/05/22 20:25:54 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> Exception in thread "main" org.apache.spark.SparkException: Please specify 
> spark.kubernetes.file.upload.path property.
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:290)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246)
> at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:165)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)
> at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
> at scala.collection.immutable.List.foldLeft(List.scala:89)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 20/05/22 20:25:54 INFO ShutdownHookManager: Shutdown hook called
> 20/05/22 20:25:54 INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/p1/y24myg413wx1l1l52bsdn2hrgq/T/spark-c94db9c5-b8a8-414d-b01d-f6369d31c9b8
>  {code}
> No changes in settings appear to be able to disable Kerberos. This is when 
> running a simple execution of the SparkPi on our lab cluster. The command 
> being used is
> {code:java}
> ./bin/spark-submit --master k8s://https://{api_hostname} --deploy-mode 
> cluster 

[jira] [Updated] (SPARK-32621) "path" option is added again to input paths during infer()

2020-08-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32621:

Fix Version/s: 3.0.1

> "path" option is added again to input paths during infer()
> --
>
> Key: SPARK-32621
> URL: https://issues.apache.org/jira/browse/SPARK-32621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> When "path" option is used when creating a DataFrame, it can cause issues 
> during infer.
> {code:java}
> class TestFileFilter extends PathFilter {
>   override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
> }
> val path = "/tmp"
> val df = spark.range(2)
> df.write.json(path + "/p=1")
> df.write.json(path + "/p=2")
> val extraOptions = Map(
>   "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
>   "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
> )
> // This works fine.
> assert(spark.read.options(extraOptions).json(path).count == 2)
> // The following with "path" option fails with the following:
> // assertion failed: Conflicting directory structures detected. Suspicious 
> paths
> //file:/tmp
> //file:/tmp/p=1
> assert(spark.read.options(extraOptions).format("json").option("path", 
> path).load.count() === 2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181343#comment-17181343
 ] 

Apache Spark commented on SPARK-32640:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29495

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32640:


Assignee: (was: Apache Spark)

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181342#comment-17181342
 ] 

Apache Spark commented on SPARK-32640:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29495

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32640:


Assignee: Apache Spark

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32671:


Assignee: Apache Spark

> Race condition in MapOutputTracker.getStatistics
> 
>
> Key: SPARK-32671
> URL: https://issues.apache.org/jira/browse/SPARK-32671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Andy Grove
>Assignee: Apache Spark
>Priority: Major
>
> MapOutputTracker.getStatistics builds an array of partition sizes for a 
> shuffle id and in some cases uses multiple threads running in parallel to 
> update this array. This code is not thread-safe and the output is 
> non-deterministic when there are multiple MapStatus entries for the same 
> partition.
> We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite 
> that depend on the output being deterministic, and intermittent failures in 
> these tests led me to track this bug down.
> The issue is trivial to fix by using an AtomicLong when building the array of 
> partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181338#comment-17181338
 ] 

Apache Spark commented on SPARK-32671:
--

User 'andygrove' has created a pull request for this issue:
https://github.com/apache/spark/pull/29494

> Race condition in MapOutputTracker.getStatistics
> 
>
> Key: SPARK-32671
> URL: https://issues.apache.org/jira/browse/SPARK-32671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Andy Grove
>Priority: Major
>
> MapOutputTracker.getStatistics builds an array of partition sizes for a 
> shuffle id and in some cases uses multiple threads running in parallel to 
> update this array. This code is not thread-safe and the output is 
> non-deterministic when there are multiple MapStatus entries for the same 
> partition.
> We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite 
> that depend on the output being deterministic, and intermittent failures in 
> these tests led me to track this bug down.
> The issue is trivial to fix by using an AtomicLong when building the array of 
> partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32671:


Assignee: (was: Apache Spark)

> Race condition in MapOutputTracker.getStatistics
> 
>
> Key: SPARK-32671
> URL: https://issues.apache.org/jira/browse/SPARK-32671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Andy Grove
>Priority: Major
>
> MapOutputTracker.getStatistics builds an array of partition sizes for a 
> shuffle id and in some cases uses multiple threads running in parallel to 
> update this array. This code is not thread-safe and the output is 
> non-deterministic when there are multiple MapStatus entries for the same 
> partition.
> We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite 
> that depend on the output being deterministic, and intermittent failures in 
> these tests led me to track this bug down.
> The issue is trivial to fix by using an AtomicLong when building the array of 
> partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181337#comment-17181337
 ] 

Apache Spark commented on SPARK-32671:
--

User 'andygrove' has created a pull request for this issue:
https://github.com/apache/spark/pull/29494

> Race condition in MapOutputTracker.getStatistics
> 
>
> Key: SPARK-32671
> URL: https://issues.apache.org/jira/browse/SPARK-32671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Andy Grove
>Priority: Major
>
> MapOutputTracker.getStatistics builds an array of partition sizes for a 
> shuffle id and in some cases uses multiple threads running in parallel to 
> update this array. This code is not thread-safe and the output is 
> non-deterministic when there are multiple MapStatus entries for the same 
> partition.
> We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite 
> that depend on the output being deterministic, and intermittent failures in 
> these tests led me to track this bug down.
> The issue is trivial to fix by using an AtomicLong when building the array of 
> partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Andy Grove (Jira)
Andy Grove created SPARK-32671:
--

 Summary: Race condition in MapOutputTracker.getStatistics
 Key: SPARK-32671
 URL: https://issues.apache.org/jira/browse/SPARK-32671
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0, 3.0.1
Reporter: Andy Grove


MapOutputTracker.getStatistics builds an array of partition sizes for a shuffle 
id and in some cases uses multiple threads running in parallel to update this 
array. This code is not thread-safe and the output is non-deterministic when 
there are multiple MapStatus entries for the same partition.

We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite that 
depend on the output being deterministic, and intermittent failures in these 
tests led me to track this bug down.

The issue is trivial to fix by using an AtomicLong when building the array of 
partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32670:

Description: For standardization of error messages and its maintenance, we 
can try to group the exception messages into a single file. 

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Xiao Li (Jira)
Xiao Li created SPARK-32670:
---

 Summary: Group exception messages in Catalyst Analyzer in one file
 Key: SPARK-32670
 URL: https://issues.apache.org/jira/browse/SPARK-32670
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xiao Li
Assignee: Xiao Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32067) [K8S] Executor pod template of ongoing submission got inadvertently altered by subsequent submission

2020-08-20 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Summary: [K8S] Executor pod template of ongoing submission got 
inadvertently altered by subsequent submission  (was: [K8S] Executor pod 
template of subsequent submission inadvertently applies to ongoing submission)

> [K8S] Executor pod template of ongoing submission got inadvertently altered 
> by subsequent submission
> 
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and with app2 launching while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 end up having app2's executor pod template applied to them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because the configmap names of 
> the two apps are the same. This causes some app1's executor pods being ramped 
> up after app2 is launched to be inadvertently launched with the app2's pod 
> template. The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32669) test expression nullability when checking result

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32669:


Assignee: Wenchen Fan  (was: Apache Spark)

> test expression nullability when checking result
> 
>
> Key: SPARK-32669
> URL: https://issues.apache.org/jira/browse/SPARK-32669
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32669) test expression nullability when checking result

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181260#comment-17181260
 ] 

Apache Spark commented on SPARK-32669:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29493

> test expression nullability when checking result
> 
>
> Key: SPARK-32669
> URL: https://issues.apache.org/jira/browse/SPARK-32669
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32669) test expression nullability when checking result

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32669:


Assignee: Apache Spark  (was: Wenchen Fan)

> test expression nullability when checking result
> 
>
> Key: SPARK-32669
> URL: https://issues.apache.org/jira/browse/SPARK-32669
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32669) test expression nullability when checking result

2020-08-20 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-32669:
---

 Summary: test expression nullability when checking result
 Key: SPARK-32669
 URL: https://issues.apache.org/jira/browse/SPARK-32669
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32650) Not able to Deserialize Mleap bundle with latest spark configuration

2020-08-20 Thread Vaishali Papneja (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181246#comment-17181246
 ] 

Vaishali Papneja commented on SPARK-32650:
--

Hi Takeshi - I have reported the issue to MLeap community.

> Not able to Deserialize Mleap bundle with latest spark configuration
> 
>
> Key: SPARK-32650
> URL: https://issues.apache.org/jira/browse/SPARK-32650
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Vaishali Papneja
>Priority: Major
>
> Hi,
>  
> I am using Databricks cluster with configuration as Spark: 3.0.0 and Scala: 
> 2.12 and want to create Machine Learning pipelines using MLeap - latest 
> version: 0.16.1.
> In my pipeline, I have created ensemble model. I am to serialize the model to 
> a bundle.
> But the issue comes while deserializing it. Below is the error I am facing:
> _java.lang.NoSuchMethodError: 
> org.apache.spark.mllib.tree.impurity.ImpurityCalculator$.getCalculator(Ljava/lang/String;[D)Lorg/apache/spark/mllib/tree/impurity/ImpurityCalculator;_
>  
> I tried using another cluster with the configuration as - spark: 2.4.5, 
> scala: 2.11 and MLeap: 0.16.0. There it is working fine.
> Please suggest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32608) Script Transform DELIMIT value should be formatted

2020-08-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32608:

Fix Version/s: 3.0.1

> Script Transform DELIMIT  value should be formatted
> ---
>
> Key: SPARK-32608
> URL: https://issues.apache.org/jira/browse/SPARK-32608
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> For SQL
>  
> {code:java}
> SELECT TRANSFORM(a, b, c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'null'
>   USING 'cat' AS (a, b, c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'NULL'
> FROM testData
> {code}
> The correct 
> TOK_TABLEROWFORMATFIELD should be , nut actually  ','
> TOK_TABLEROWFORMATLINES should be \n  but actually '\n'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-20 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181193#comment-17181193
 ] 

Takeshi Yamamuro commented on SPARK-32632:
--

An original motivation for this feature was to use date/timestamp columns as 
partitioning ones: https://issues.apache.org/jira/browse/SPARK-22814. So, it is 
not intended for filtering out rows in the first place. If you want to filter 
out these rows, I think you just have to use predicates like this;
{code}
val data = spark.read.jdbc(url, table, "id", 2, 5, 
3,buildProperties()).where("2 < id and id < 5")
{code}`

> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When I use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method,  for  rows of the 
> first partition aren't limited by the lowerBound and the ones of the last 
> partition are not limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition should be " `id` >=2 and `id` < 3 
> "  because the lowerBound is 2, and the clause of the last partition should 
> be " `id` >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32663) TransportClient getting closed when there are outstanding requests to the server

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32663:


Assignee: (was: Apache Spark)

> TransportClient getting closed when there are outstanding requests to the 
> server
> 
>
> Key: SPARK-32663
> URL: https://issues.apache.org/jira/browse/SPARK-32663
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Priority: Major
>
> The implementation of {{removeBlocks}} and {{getHostLocalDirs}} in 
> {{ExternalBlockStoreClient}} closes the client after processing a response in 
> the callback. 
> This is a cached client which will be re-used for other responses. There 
> could be other outstanding request to the shuffle service, so it should not 
> be closed after processing a response. 
> Seems like this is a bug introduced with SPARK-27651 and SPARK-27677. 
> The older methods  {{registerWithShuffleServer}} and {{fetchBlocks}} didn't 
> close the client.
> cc [~attilapiros] [~vanzin] [~mridulm80]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32663) TransportClient getting closed when there are outstanding requests to the server

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181180#comment-17181180
 ] 

Apache Spark commented on SPARK-32663:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/29492

> TransportClient getting closed when there are outstanding requests to the 
> server
> 
>
> Key: SPARK-32663
> URL: https://issues.apache.org/jira/browse/SPARK-32663
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Priority: Major
>
> The implementation of {{removeBlocks}} and {{getHostLocalDirs}} in 
> {{ExternalBlockStoreClient}} closes the client after processing a response in 
> the callback. 
> This is a cached client which will be re-used for other responses. There 
> could be other outstanding request to the shuffle service, so it should not 
> be closed after processing a response. 
> Seems like this is a bug introduced with SPARK-27651 and SPARK-27677. 
> The older methods  {{registerWithShuffleServer}} and {{fetchBlocks}} didn't 
> close the client.
> cc [~attilapiros] [~vanzin] [~mridulm80]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32663) TransportClient getting closed when there are outstanding requests to the server

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32663:


Assignee: Apache Spark

> TransportClient getting closed when there are outstanding requests to the 
> server
> 
>
> Key: SPARK-32663
> URL: https://issues.apache.org/jira/browse/SPARK-32663
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Assignee: Apache Spark
>Priority: Major
>
> The implementation of {{removeBlocks}} and {{getHostLocalDirs}} in 
> {{ExternalBlockStoreClient}} closes the client after processing a response in 
> the callback. 
> This is a cached client which will be re-used for other responses. There 
> could be other outstanding request to the shuffle service, so it should not 
> be closed after processing a response. 
> Seems like this is a bug introduced with SPARK-27651 and SPARK-27677. 
> The older methods  {{registerWithShuffleServer}} and {{fetchBlocks}} didn't 
> close the client.
> cc [~attilapiros] [~vanzin] [~mridulm80]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32663) TransportClient getting closed when there are outstanding requests to the server

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181181#comment-17181181
 ] 

Apache Spark commented on SPARK-32663:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/29492

> TransportClient getting closed when there are outstanding requests to the 
> server
> 
>
> Key: SPARK-32663
> URL: https://issues.apache.org/jira/browse/SPARK-32663
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Priority: Major
>
> The implementation of {{removeBlocks}} and {{getHostLocalDirs}} in 
> {{ExternalBlockStoreClient}} closes the client after processing a response in 
> the callback. 
> This is a cached client which will be re-used for other responses. There 
> could be other outstanding request to the shuffle service, so it should not 
> be closed after processing a response. 
> Seems like this is a bug introduced with SPARK-27651 and SPARK-27677. 
> The older methods  {{registerWithShuffleServer}} and {{fetchBlocks}} didn't 
> close the client.
> cc [~attilapiros] [~vanzin] [~mridulm80]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32665) Deletes orphan directories under a warehouse dir in SQLQueryTestSuite

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32665:
-

Assignee: Takeshi Yamamuro

> Deletes orphan directories under a warehouse dir in SQLQueryTestSuite
> -
>
> Key: SPARK-32665
> URL: https://issues.apache.org/jira/browse/SPARK-32665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> In case that a last SQLQueryTestSuite test run is killed, it will fail in a 
> next run because of a following reason:
> {code}
> [info] org.apache.spark.sql.SQLQueryTestSuite *** ABORTED *** (17 seconds, 
> 483 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Can not create the managed 
> table('`testdata`'). The associated 
> location('file:/Users/maropu/Repositories/spark/spark-master/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/testdata')
>  already exists.;
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:355)
> [info]   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:170)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
> {code}
> This ticket targets at adding code to deletes orphan directories under a 
> warehouse dir in SQLQueryTestSuite before creating test tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32665) Deletes orphan directories under a warehouse dir in SQLQueryTestSuite

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32665.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29488
[https://github.com/apache/spark/pull/29488]

> Deletes orphan directories under a warehouse dir in SQLQueryTestSuite
> -
>
> Key: SPARK-32665
> URL: https://issues.apache.org/jira/browse/SPARK-32665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.1.0
>
>
> In case that a last SQLQueryTestSuite test run is killed, it will fail in a 
> next run because of a following reason:
> {code}
> [info] org.apache.spark.sql.SQLQueryTestSuite *** ABORTED *** (17 seconds, 
> 483 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Can not create the managed 
> table('`testdata`'). The associated 
> location('file:/Users/maropu/Repositories/spark/spark-master/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/testdata')
>  already exists.;
> [info]   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:355)
> [info]   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:170)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
> {code}
> This ticket targets at adding code to deletes orphan directories under a 
> warehouse dir in SQLQueryTestSuite before creating test tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >