[jira] [Created] (SPARK-32814) Metaclasses are broken for a few classes in Python 3

2020-09-07 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-32814:
--

 Summary: Metaclasses are broken for a few classes in Python 3
 Key: SPARK-32814
 URL: https://issues.apache.org/jira/browse/SPARK-32814
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark, SQL
Affects Versions: 3.0.0, 2.4.0, 3.1.0
Reporter: Maciej Szymkiewicz


As of Python 3 {{__metaclass__}} is no longer supported 
https://www.python.org/dev/peps/pep-3115/.

However, we have multiple classes which where never migrated to Python 3 
compatible syntax:

- A number of ML {{Params}}} with {{__metaclass__ = ABCMeta}}
- Some of the SQL {{types}} with {{__metaclass__ = DataTypeSingleton}}


As a result some functionalities are broken in Python 3. For example 


{code:python}
>>> from pyspark.sql.types import BooleanType   
>>> 
>>>
>>> BooleanType() is BooleanType()  
>>> 
>>>
False
{code}

or

{code:python}
>>> import inspect  
>>> 
>>>
>>> from pyspark.ml import Estimator
>>> 
>>>
>>> inspect.isabstract(Estimator)   
>>> 
>>>
False
{code}

where in both cases we expect to see {{True}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191954#comment-17191954
 ] 

Apache Spark commented on SPARK-31511:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/29669

> Make BytesToBytesMap iterator() thread-safe
> ---
>
> Key: SPARK-31511
> URL: https://issues.apache.org/jira/browse/SPARK-31511
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 
> 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> BytesToBytesMap currently has a thread safe and unsafe iterator method. This 
> is somewhat confusing. We should make iterator() thread safe and remove the 
> safeIterator() function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191953#comment-17191953
 ] 

Apache Spark commented on SPARK-31511:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/29669

> Make BytesToBytesMap iterator() thread-safe
> ---
>
> Key: SPARK-31511
> URL: https://issues.apache.org/jira/browse/SPARK-31511
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 
> 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> BytesToBytesMap currently has a thread safe and unsafe iterator method. This 
> is somewhat confusing. We should make iterator() thread safe and remove the 
> safeIterator() function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32736) Avoid caching the removed decommissioned executors in TaskSchedulerImpl

2020-09-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32736.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29579
[https://github.com/apache/spark/pull/29579]

> Avoid caching the removed decommissioned executors in TaskSchedulerImpl
> ---
>
> Key: SPARK-32736
> URL: https://issues.apache.org/jira/browse/SPARK-32736
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> We can save the host directly in the ExecutorDecommissionState. Therefore, 
> when the executor lost, we could unregister the shuffle map status on the 
> host. Thus, we don't need to hold the cache to wait for FetchFailureException 
> to do the unregister.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32764:
--
Fix Version/s: (was: 3.0.1)
   3.0.2

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Izek Greenfield
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.0, 3.0.2
>
> Attachments: 2.4_codegen.txt, 3.0_Codegen.txt
>
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.
> **Apache Spark 2.4.6 RESULT**
> {code}
> scala> spark.version
> res0: String = 2.4.6
> scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < 
> col("pos")).show
> ++---+-+
> | neg|pos| comp|
> ++---+-+
> |-0.0|0.0|false|
> ++---+-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32764.
---
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29647
[https://github.com/apache/spark/pull/29647]

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Izek Greenfield
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.1, 3.1.0
>
> Attachments: 2.4_codegen.txt, 3.0_Codegen.txt
>
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.
> **Apache Spark 2.4.6 RESULT**
> {code}
> scala> spark.version
> res0: String = 2.4.6
> scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < 
> col("pos")).show
> ++---+-+
> | neg|pos| comp|
> ++---+-+
> |-0.0|0.0|false|
> ++---+-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32764:
-

Assignee: Wenchen Fan

> compare of -0.0 < 0.0 return true
> -
>
> Key: SPARK-32764
> URL: https://issues.apache.org/jira/browse/SPARK-32764
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Izek Greenfield
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Attachments: 2.4_codegen.txt, 3.0_Codegen.txt
>
>
> {code:scala}
>  val spark: SparkSession = SparkSession
>   .builder()
>   .master("local")
>   .appName("SparkByExamples.com")
>   .getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.sqlContext.implicits._
> val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
>   .withColumn("comp", col("neg") < col("pos"))
>   df.show(false)
> ==
> ++---++
> |neg |pos|comp|
> ++---++
> |-0.0|0.0|true|
> ++---++{code}
> I think that result should be false.
> **Apache Spark 2.4.6 RESULT**
> {code}
> scala> spark.version
> res0: String = 2.4.6
> scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < 
> col("pos")).show
> ++---+-+
> | neg|pos| comp|
> ++---+-+
> |-0.0|0.0|false|
> ++---+-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32796) Make withField API support nested struct in array

2020-09-07 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-32796.
-
Resolution: Won't Fix

> Make withField API support nested struct in array
> -
>
> Key: SPARK-32796
> URL: https://issues.apache.org/jira/browse/SPARK-32796
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently {{Column.withField}} only supports {{StructType}}. For nested 
> struct in {{ArrayType}}, it doesn't support. We can support {{ArrayType}} to 
> make the API more general and useful.
> It could significantly simplify how we add/replace deeply nested fields.
> For example,
> {code}
> private lazy val arrayType = ArrayType(
> StructType(Seq(
>   StructField("a", IntegerType, nullable = false),
>   StructField("b", IntegerType, nullable = true),
>   StructField("c", IntegerType, nullable = false))),
> containsNull = true)
> private lazy val arrayStructArrayLevel1: DataFrame = spark.createDataFrame(
> sparkContext.parallelize(Row(Array(Row(Array(Row(1, null, 3)), null, 3))) 
> :: Nil),
> StructType(
>   Seq(StructField("a", ArrayType(
> StructType(Seq(
>   StructField("a", arrayType, nullable = false),
>   StructField("b", IntegerType, nullable = true),
>   StructField("c", IntegerType, nullable = false))),
> containsNull = false)
> {code}
> The data looks like:
> {code}
> +---+
> |a  |
> +---+
> |[{[{1, null, 3}], null, 3}]|
> +---+
> {code}
> In order to replace deeply nested b column, like:
> {code}
> ++
> |a   |
> ++
> |[{[{1, 2, 3}], null, 3}]|
> ++
> {code}
> Currently by using transform + withField, we probably need:
> {code}
> arrayStructArrayLevel1.withColumn("a",
>   transform($"a", _.withField("a",
> flatten(transform($"a.a", transform(_, _.withField("b", lit(2
> {code}
> Using modified withField, we can do it like:
> {code}
> arrayStructArrayLevel1.withColumn("a", $"a".withField("a.b", lit(2)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32354) Fix & re-enable failing R K8s tests

2020-09-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191929#comment-17191929
 ] 

Dongjoon Hyun edited comment on SPARK-32354 at 9/8/20, 3:29 AM:


As I reported during 3.0.1 RC, `dev/make-distribution.sh` doesn't work for me 
on my environment. It failed when I enabled R. So, I couldn't even run R IT 
test, [~hyukjin.kwon].


was (Author: dongjoon):
As I reported during 3.0.1 RC, `dev/make-distribution.sh` doesn't work for me 
on my environment. It failed when I enabled R. So, I couldn't even run R IT 
test.

> Fix & re-enable failing R K8s tests
> ---
>
> Key: SPARK-32354
> URL: https://issues.apache.org/jira/browse/SPARK-32354
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, R
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Blocker
>
> The R K8s integration tests have been failing for extended period of time. 
> Some of it was waiting for the R version to upgrade, but now that it has been 
> upgraded it is still failing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32354) Fix & re-enable failing R K8s tests

2020-09-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191929#comment-17191929
 ] 

Dongjoon Hyun commented on SPARK-32354:
---

As I reported during 3.0.1 RC, `dev/make-distribution.sh` doesn't work for me 
on my environment. It failed when I enabled R. So, I couldn't even run R IT 
test.

> Fix & re-enable failing R K8s tests
> ---
>
> Key: SPARK-32354
> URL: https://issues.apache.org/jira/browse/SPARK-32354
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, R
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Blocker
>
> The R K8s integration tests have been failing for extended period of time. 
> Some of it was waiting for the R version to upgrade, but now that it has been 
> upgraded it is still failing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe

2020-09-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31511:

Fix Version/s: 2.4.7

> Make BytesToBytesMap iterator() thread-safe
> ---
>
> Key: SPARK-31511
> URL: https://issues.apache.org/jira/browse/SPARK-31511
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 
> 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> BytesToBytesMap currently has a thread safe and unsafe iterator method. This 
> is somewhat confusing. We should make iterator() thread safe and remove the 
> safeIterator() function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32812) Run tests script for Python fails in certain environments

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32812:


Assignee: Haejoon Lee

> Run tests script for Python fails in certain environments
> -
>
> Key: SPARK-32812
> URL: https://issues.apache.org/jira/browse/SPARK-32812
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> When running PySpark test in the local environment with "python/run-tests" 
> command, the following error could occur.
>  {code}
> Traceback (most recent call last):
>  File "", line 1, in 
> ...
> raise RuntimeError('''
> RuntimeError:
>  An attempt has been made to start a new process before the
>  current process has finished its bootstrapping phase.
> This probably means that you are not using fork to start your
>  child processes and you have forgotten to use the proper idiom
>  in the main module:
> if __name__ == '__main__':
>  freeze_support()
>  ...
> The "freeze_support()" line can be omitted if the program
>  is not going to be frozen to produce an executable.
> Traceback (most recent call last):
> ...
>  raise EOFError
> EOFError
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32812) Run tests script for Python fails in certain environments

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32812.
--
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29666
[https://github.com/apache/spark/pull/29666]

> Run tests script for Python fails in certain environments
> -
>
> Key: SPARK-32812
> URL: https://issues.apache.org/jira/browse/SPARK-32812
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> When running PySpark test in the local environment with "python/run-tests" 
> command, the following error could occur.
>  {code}
> Traceback (most recent call last):
>  File "", line 1, in 
> ...
> raise RuntimeError('''
> RuntimeError:
>  An attempt has been made to start a new process before the
>  current process has finished its bootstrapping phase.
> This probably means that you are not using fork to start your
>  child processes and you have forgotten to use the proper idiom
>  in the main module:
> if __name__ == '__main__':
>  freeze_support()
>  ...
> The "freeze_support()" line can be omitted if the program
>  is not going to be frozen to produce an executable.
> Traceback (most recent call last):
> ...
>  raise EOFError
> EOFError
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-07 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32813:

Affects Version/s: 3.1.0

> Reading parquet rdd in non columnar mode fails in multithreaded environment
> ---
>
> Key: SPARK-32813
> URL: https://issues.apache.org/jira/browse/SPARK-32813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
> Environment: Spark 3.0.0, Scala 2.12.12
>Reporter: Vladimir Klyushnikov
>Priority: Major
>
> Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
> session was  created in one thread and rdd is being read in another  - so 
> InheritableThreadLocal  with active session is not propagated. Code below was 
> working perfectly in Spark 2.X, but fails in Spark 3  
> {code:scala}
> import java.util.concurrent.Executors
> import org.apache.spark.sql.SparkSession
> import scala.concurrent.{Await, ExecutionContext, Future}
> import scala.concurrent.duration._
> object Main {
>   final case class Data(list: List[Int])
>   def main(args: Array[String]): Unit = {
> val executor1 = Executors.newSingleThreadExecutor()
> val executor2 = Executors.newSingleThreadExecutor()
> try {
>   val ds = Await.result(Future {
> val session = 
> SparkSession.builder().appName("test").master("local[*]").getOrCreate()
> import session.implicits._
> val path = "test.parquet"
> session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path)
> session.read.parquet(path).as[Data]
>   }(ExecutionContext.fromExecutorService(executor1)), 1.minute)
>   Await.result(Future {
> ds.rdd.collect().foreach(println(_))
>   }(ExecutionContext.fromExecutorService(executor2)), 1.minute)
> } finally {
>   executor1.shutdown()
>   executor2.shutdown()
> }
>   }
> }
> {code}
> This code fails with following exception:
> {code}
> Exception in thread "main" java.util.NoSuchElementException: 
> None.getException in thread "main" java.util.NoSuchElementException: None.get 
> at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
> org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32813:


Assignee: Apache Spark

> Reading parquet rdd in non columnar mode fails in multithreaded environment
> ---
>
> Key: SPARK-32813
> URL: https://issues.apache.org/jira/browse/SPARK-32813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, Scala 2.12.12
>Reporter: Vladimir Klyushnikov
>Assignee: Apache Spark
>Priority: Major
>
> Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
> session was  created in one thread and rdd is being read in another  - so 
> InheritableThreadLocal  with active session is not propagated. Code below was 
> working perfectly in Spark 2.X, but fails in Spark 3  
> {code:scala}
> import java.util.concurrent.Executors
> import org.apache.spark.sql.SparkSession
> import scala.concurrent.{Await, ExecutionContext, Future}
> import scala.concurrent.duration._
> object Main {
>   final case class Data(list: List[Int])
>   def main(args: Array[String]): Unit = {
> val executor1 = Executors.newSingleThreadExecutor()
> val executor2 = Executors.newSingleThreadExecutor()
> try {
>   val ds = Await.result(Future {
> val session = 
> SparkSession.builder().appName("test").master("local[*]").getOrCreate()
> import session.implicits._
> val path = "test.parquet"
> session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path)
> session.read.parquet(path).as[Data]
>   }(ExecutionContext.fromExecutorService(executor1)), 1.minute)
>   Await.result(Future {
> ds.rdd.collect().foreach(println(_))
>   }(ExecutionContext.fromExecutorService(executor2)), 1.minute)
> } finally {
>   executor1.shutdown()
>   executor2.shutdown()
> }
>   }
> }
> {code}
> This code fails with following exception:
> {code}
> Exception in thread "main" java.util.NoSuchElementException: 
> None.getException in thread "main" java.util.NoSuchElementException: None.get 
> at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
> org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191904#comment-17191904
 ] 

Apache Spark commented on SPARK-32813:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29667

> Reading parquet rdd in non columnar mode fails in multithreaded environment
> ---
>
> Key: SPARK-32813
> URL: https://issues.apache.org/jira/browse/SPARK-32813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, Scala 2.12.12
>Reporter: Vladimir Klyushnikov
>Priority: Major
>
> Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
> session was  created in one thread and rdd is being read in another  - so 
> InheritableThreadLocal  with active session is not propagated. Code below was 
> working perfectly in Spark 2.X, but fails in Spark 3  
> {code:scala}
> import java.util.concurrent.Executors
> import org.apache.spark.sql.SparkSession
> import scala.concurrent.{Await, ExecutionContext, Future}
> import scala.concurrent.duration._
> object Main {
>   final case class Data(list: List[Int])
>   def main(args: Array[String]): Unit = {
> val executor1 = Executors.newSingleThreadExecutor()
> val executor2 = Executors.newSingleThreadExecutor()
> try {
>   val ds = Await.result(Future {
> val session = 
> SparkSession.builder().appName("test").master("local[*]").getOrCreate()
> import session.implicits._
> val path = "test.parquet"
> session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path)
> session.read.parquet(path).as[Data]
>   }(ExecutionContext.fromExecutorService(executor1)), 1.minute)
>   Await.result(Future {
> ds.rdd.collect().foreach(println(_))
>   }(ExecutionContext.fromExecutorService(executor2)), 1.minute)
> } finally {
>   executor1.shutdown()
>   executor2.shutdown()
> }
>   }
> }
> {code}
> This code fails with following exception:
> {code}
> Exception in thread "main" java.util.NoSuchElementException: 
> None.getException in thread "main" java.util.NoSuchElementException: None.get 
> at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
> org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32813:


Assignee: (was: Apache Spark)

> Reading parquet rdd in non columnar mode fails in multithreaded environment
> ---
>
> Key: SPARK-32813
> URL: https://issues.apache.org/jira/browse/SPARK-32813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, Scala 2.12.12
>Reporter: Vladimir Klyushnikov
>Priority: Major
>
> Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
> session was  created in one thread and rdd is being read in another  - so 
> InheritableThreadLocal  with active session is not propagated. Code below was 
> working perfectly in Spark 2.X, but fails in Spark 3  
> {code:scala}
> import java.util.concurrent.Executors
> import org.apache.spark.sql.SparkSession
> import scala.concurrent.{Await, ExecutionContext, Future}
> import scala.concurrent.duration._
> object Main {
>   final case class Data(list: List[Int])
>   def main(args: Array[String]): Unit = {
> val executor1 = Executors.newSingleThreadExecutor()
> val executor2 = Executors.newSingleThreadExecutor()
> try {
>   val ds = Await.result(Future {
> val session = 
> SparkSession.builder().appName("test").master("local[*]").getOrCreate()
> import session.implicits._
> val path = "test.parquet"
> session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path)
> session.read.parquet(path).as[Data]
>   }(ExecutionContext.fromExecutorService(executor1)), 1.minute)
>   Await.result(Future {
> ds.rdd.collect().foreach(println(_))
>   }(ExecutionContext.fromExecutorService(executor2)), 1.minute)
> } finally {
>   executor1.shutdown()
>   executor2.shutdown()
> }
>   }
> }
> {code}
> This code fails with following exception:
> {code}
> Exception in thread "main" java.util.NoSuchElementException: 
> None.getException in thread "main" java.util.NoSuchElementException: None.get 
> at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
> org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32186) Development - Debugging

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32186.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29639
[https://github.com/apache/spark/pull/29639]

> Development - Debugging
> ---
>
> Key: SPARK-32186
> URL: https://issues.apache.org/jira/browse/SPARK-32186
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> 1. Python Profiler: 
> https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/profiler.html,
>  
> {code}
> >>> sc._conf.set("spark.python.profile", "true")
> >>> rdd = sc.parallelize(range(100)).map(str)
> >>> rdd.count()
> 100
> >>> sc.show_profiles()
> {code}
> 2. Python Debugger, e.g.) 
> https://www.jetbrains.com/help/pycharm/remote-debugging-with-product.html#remote-interpreter
>  \(?\)
> 3. Monitoring Python Workers: {{top}}, `{{ls}}`, etc. \(?\)
> 4. PyCharm setup \(?\) (at https://issues.apache.org/jira/browse/SPARK-32189)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32812) Run tests script for Python fails in certain environments

2020-09-07 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-32812:

Summary: Run tests script for Python fails in certain environments  (was: 
Run tests script for Python fails on local environment)

> Run tests script for Python fails in certain environments
> -
>
> Key: SPARK-32812
> URL: https://issues.apache.org/jira/browse/SPARK-32812
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Priority: Major
>
> When running PySpark test in the local environment with "python/run-tests" 
> command, the following error could occur.
>  {code}
> Traceback (most recent call last):
>  File "", line 1, in 
> ...
> raise RuntimeError('''
> RuntimeError:
>  An attempt has been made to start a new process before the
>  current process has finished its bootstrapping phase.
> This probably means that you are not using fork to start your
>  child processes and you have forgotten to use the proper idiom
>  in the main module:
> if __name__ == '__main__':
>  freeze_support()
>  ...
> The "freeze_support()" line can be omitted if the program
>  is not going to be frozen to produce an executable.
> Traceback (most recent call last):
> ...
>  raise EOFError
> EOFError
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32812) Run tests script for Python fails on local environment

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191899#comment-17191899
 ] 

Apache Spark commented on SPARK-32812:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/29666

> Run tests script for Python fails on local environment
> --
>
> Key: SPARK-32812
> URL: https://issues.apache.org/jira/browse/SPARK-32812
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Priority: Major
>
> When running PySpark test in the local environment with "python/run-tests" 
> command, the following error could occur.
>  {code}
> Traceback (most recent call last):
>  File "", line 1, in 
> ...
> raise RuntimeError('''
> RuntimeError:
>  An attempt has been made to start a new process before the
>  current process has finished its bootstrapping phase.
> This probably means that you are not using fork to start your
>  child processes and you have forgotten to use the proper idiom
>  in the main module:
> if __name__ == '__main__':
>  freeze_support()
>  ...
> The "freeze_support()" line can be omitted if the program
>  is not going to be frozen to produce an executable.
> Traceback (most recent call last):
> ...
>  raise EOFError
> EOFError
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32812) Run tests script for Python fails on local environment

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32812:


Assignee: (was: Apache Spark)

> Run tests script for Python fails on local environment
> --
>
> Key: SPARK-32812
> URL: https://issues.apache.org/jira/browse/SPARK-32812
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Priority: Major
>
> When running PySpark test in the local environment with "python/run-tests" 
> command, the following error could occur.
>  {code}
> Traceback (most recent call last):
>  File "", line 1, in 
> ...
> raise RuntimeError('''
> RuntimeError:
>  An attempt has been made to start a new process before the
>  current process has finished its bootstrapping phase.
> This probably means that you are not using fork to start your
>  child processes and you have forgotten to use the proper idiom
>  in the main module:
> if __name__ == '__main__':
>  freeze_support()
>  ...
> The "freeze_support()" line can be omitted if the program
>  is not going to be frozen to produce an executable.
> Traceback (most recent call last):
> ...
>  raise EOFError
> EOFError
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32812) Run tests script for Python fails on local environment

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191898#comment-17191898
 ] 

Apache Spark commented on SPARK-32812:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/29666

> Run tests script for Python fails on local environment
> --
>
> Key: SPARK-32812
> URL: https://issues.apache.org/jira/browse/SPARK-32812
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Priority: Major
>
> When running PySpark test in the local environment with "python/run-tests" 
> command, the following error could occur.
>  {code}
> Traceback (most recent call last):
>  File "", line 1, in 
> ...
> raise RuntimeError('''
> RuntimeError:
>  An attempt has been made to start a new process before the
>  current process has finished its bootstrapping phase.
> This probably means that you are not using fork to start your
>  child processes and you have forgotten to use the proper idiom
>  in the main module:
> if __name__ == '__main__':
>  freeze_support()
>  ...
> The "freeze_support()" line can be omitted if the program
>  is not going to be frozen to produce an executable.
> Traceback (most recent call last):
> ...
>  raise EOFError
> EOFError
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32812) Run tests script for Python fails on local environment

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32812:


Assignee: Apache Spark

> Run tests script for Python fails on local environment
> --
>
> Key: SPARK-32812
> URL: https://issues.apache.org/jira/browse/SPARK-32812
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> When running PySpark test in the local environment with "python/run-tests" 
> command, the following error could occur.
>  {code}
> Traceback (most recent call last):
>  File "", line 1, in 
> ...
> raise RuntimeError('''
> RuntimeError:
>  An attempt has been made to start a new process before the
>  current process has finished its bootstrapping phase.
> This probably means that you are not using fork to start your
>  child processes and you have forgotten to use the proper idiom
>  in the main module:
> if __name__ == '__main__':
>  freeze_support()
>  ...
> The "freeze_support()" line can be omitted if the program
>  is not going to be frozen to produce an executable.
> Traceback (most recent call last):
> ...
>  raise EOFError
> EOFError
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32810.
--
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29659
[https://github.com/apache/spark/pull/29659]

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32798.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29657
[https://github.com/apache/spark/pull/29657]

> Make unionByName optionally fill missing columns with nulls in PySpark
> --
>
> Key: SPARK-32798
> URL: https://issues.apache.org/jira/browse/SPARK-32798
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: starter
> Fix For: 3.1.0
>
>
> It would be better to expose {{unionByName}} parameter in Python APIs as 
> well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32798:


Assignee: Haejoon Lee

> Make unionByName optionally fill missing columns with nulls in PySpark
> --
>
> Key: SPARK-32798
> URL: https://issues.apache.org/jira/browse/SPARK-32798
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: starter
>
> It would be better to expose {{unionByName}} parameter in Python APIs as 
> well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-07 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32813:

Priority: Major  (was: Blocker)

> Reading parquet rdd in non columnar mode fails in multithreaded environment
> ---
>
> Key: SPARK-32813
> URL: https://issues.apache.org/jira/browse/SPARK-32813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, Scala 2.12.12
>Reporter: Vladimir Klyushnikov
>Priority: Major
>
> Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
> session was  created in one thread and rdd is being read in another  - so 
> InheritableThreadLocal  with active session is not propagated. Code below was 
> working perfectly in Spark 2.X, but fails in Spark 3  
> {code:scala}
> import java.util.concurrent.Executors
> import org.apache.spark.sql.SparkSession
> import scala.concurrent.{Await, ExecutionContext, Future}
> import scala.concurrent.duration._
> object Main {
>   final case class Data(list: List[Int])
>   def main(args: Array[String]): Unit = {
> val executor1 = Executors.newSingleThreadExecutor()
> val executor2 = Executors.newSingleThreadExecutor()
> try {
>   val ds = Await.result(Future {
> val session = 
> SparkSession.builder().appName("test").master("local[*]").getOrCreate()
> import session.implicits._
> val path = "test.parquet"
> session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path)
> session.read.parquet(path).as[Data]
>   }(ExecutionContext.fromExecutorService(executor1)), 1.minute)
>   Await.result(Future {
> ds.rdd.collect().foreach(println(_))
>   }(ExecutionContext.fromExecutorService(executor2)), 1.minute)
> } finally {
>   executor1.shutdown()
>   executor2.shutdown()
> }
>   }
> }
> {code}
> This code fails with following exception:
> {code}
> Exception in thread "main" java.util.NoSuchElementException: 
> None.getException in thread "main" java.util.NoSuchElementException: None.get 
> at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
> org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191882#comment-17191882
 ] 

Apache Spark commented on SPARK-32753:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29665

> Deduplicating and repartitioning the same column create duplicate rows with 
> AQE
> ---
>
> Key: SPARK-32753
> URL: https://issues.apache.org/jira/browse/SPARK-32753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> To reproduce:
> {code:java}
> spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
> val df = spark.sql("select id from v1 group by id distribute by id") 
> println(df.collect().toArray.mkString(","))
> println(df.queryExecution.executedPlan)
> // With AQE
> [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
> AdaptiveSparkPlan(isFinalPlan=true)
> +- CustomShuffleReader local
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(id#183L, 10), true
>  +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
> +- Union
>:- *(1) Range (0, 10, step=1, splits=2)
>+- *(2) Range (0, 10, step=1, splits=2)
> // Without AQE
> [4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
> *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
> +- Exchange hashpartitioning(id#206L, 10), true
>+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
>   +- Union
>  :- *(1) Range (0, 10, step=1, splits=2)
>  +- *(2) Range (0, 10, step=1, splits=2){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191876#comment-17191876
 ] 

Apache Spark commented on SPARK-32138:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29664

> Drop Python 2, 3.4 and 3.5 in codes and documentation
> -
>
> Key: SPARK-32138
> URL: https://issues.apache.org/jira/browse/SPARK-32138
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-07 Thread Vladimir Klyushnikov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Klyushnikov updated SPARK-32813:
-
Description: 
Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
session was  created in one thread and rdd is being read in another  - so 
InheritableThreadLocal  with active session is not propagated. Code below was 
working perfectly in Spark 2.X, but fails in Spark 3  
{code:scala}
import java.util.concurrent.Executors

import org.apache.spark.sql.SparkSession

import scala.concurrent.{Await, ExecutionContext, Future}
import scala.concurrent.duration._

object Main {

  final case class Data(list: List[Int])

  def main(args: Array[String]): Unit = {

val executor1 = Executors.newSingleThreadExecutor()
val executor2 = Executors.newSingleThreadExecutor()
try {
  val ds = Await.result(Future {
val session = 
SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import session.implicits._

val path = "test.parquet"
session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path)
session.read.parquet(path).as[Data]
  }(ExecutionContext.fromExecutorService(executor1)), 1.minute)

  Await.result(Future {
ds.rdd.collect().foreach(println(_))
  }(ExecutionContext.fromExecutorService(executor2)), 1.minute)

} finally {
  executor1.shutdown()
  executor2.shutdown()
}
  }
}
{code}
This code fails with following exception:
{code}
Exception in thread "main" java.util.NoSuchElementException: None.getException 
in thread "main" java.util.NoSuchElementException: None.get at 
scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
{code}


  was:
Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
session was  created in one thread and rdd is being read in another  - so 
InheritableThreadLocal  with active session is not propagated. Code below was 
working perfectly in Spark 2.X, but fails in Spark 3  
{code:scala}
import java.util.concurrent.Executors

import org.apache.spark.sql.SparkSession

import scala.concurrent.{Await, ExecutionContext, Future}
import scala.concurrent.duration._

object Main {

  final case class Data(list: List[Int])

  def main(args: Array[String]): Unit = {

val executor1 = Executors.newSingleThreadExecutor()
val executor2 = Executors.newSingleThreadExecutor()
try {
  val ds = Await.result(Future {
val session = 
SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import session.implicits._

session.createDataset(Data(1 :: Nil) :: 
Nil).write.parquet("test.parquet")
session.read.parquet("test.parquet").as[Data]
  }(ExecutionContext.fromExecutorService(executor1)), 1.minute)

  Await.result(Future {
ds.rdd.collect().foreach(println(_))
  }(ExecutionContext.fromExecutorService(executor2)), 1.minute)

} finally {
  executor1.shutdown()
  executor2.shutdown()
}
  }
}
{code}
This code fails with following exception:
{code}
Exception in thread "main" java.util.NoSuchElementException: None.getException 
in thread "main" java.util.NoSuchElementException: None.get at 
scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 

[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191822#comment-17191822
 ] 

Apache Spark commented on SPARK-32810:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29663

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



subscribe

2020-09-07 Thread Bowen Li



[jira] [Commented] (SPARK-32603) CREATE/REPLACE TABLE AS SELECT not support multi-part identifiers

2020-09-07 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191813#comment-17191813
 ] 

Huaxin Gao commented on SPARK-32603:


I thought CREATE TABLE syntax will be unified for hive and datasource? Anyway, 
I opened this ticket to track the problems I hit when testing V2 JDBC. If in 
the future we decide not to support this syntax, I will close this jira. 

> CREATE/REPLACE TABLE AS SELECT not support multi-part identifiers
> -
>
> Key: SPARK-32603
> URL: https://issues.apache.org/jira/browse/SPARK-32603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> {code:java}
> == SQL ==
> CREATE TABLE h2.test.abc AS SELECT * FROM h2.test.people
> ^^^
> org.apache.spark.sql.catalyst.parser.ParseException: 
> Operation not allowed: CREATE TABLE ... STORED AS ... does not support 
> multi-part identifiers
> {code}
> {code:java}
> == SQL ==
> CREATE OR REPLACE TABLE h2.test.abc AS SELECT 1 as col
> ^^^
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input 'AS' expecting {'(', 'USING'}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-07 Thread Vladimir Klyushnikov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Klyushnikov updated SPARK-32813:
-
Environment: Spark 3.0.0, Scala 2.12.12

> Reading parquet rdd in non columnar mode fails in multithreaded environment
> ---
>
> Key: SPARK-32813
> URL: https://issues.apache.org/jira/browse/SPARK-32813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, Scala 2.12.12
>Reporter: Vladimir Klyushnikov
>Priority: Blocker
>
> Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
> session was  created in one thread and rdd is being read in another  - so 
> InheritableThreadLocal  with active session is not propagated. Code below was 
> working perfectly in Spark 2.X, but fails in Spark 3  
> {code:scala}
> import java.util.concurrent.Executors
> import org.apache.spark.sql.SparkSession
> import scala.concurrent.{Await, ExecutionContext, Future}
> import scala.concurrent.duration._
> object Main {
>   final case class Data(list: List[Int])
>   def main(args: Array[String]): Unit = {
> val executor1 = Executors.newSingleThreadExecutor()
> val executor2 = Executors.newSingleThreadExecutor()
> try {
>   val ds = Await.result(Future {
> val session = 
> SparkSession.builder().appName("test").master("local[*]").getOrCreate()
> import session.implicits._
> session.createDataset(Data(1 :: Nil) :: 
> Nil).write.parquet("test.parquet")
> session.read.parquet("test.parquet").as[Data]
>   }(ExecutionContext.fromExecutorService(executor1)), 1.minute)
>   Await.result(Future {
> ds.rdd.collect().foreach(println(_))
>   }(ExecutionContext.fromExecutorService(executor2)), 1.minute)
> } finally {
>   executor1.shutdown()
>   executor2.shutdown()
> }
>   }
> }
> {code}
> This code fails with following exception:
> {code}
> Exception in thread "main" java.util.NoSuchElementException: 
> None.getException in thread "main" java.util.NoSuchElementException: None.get 
> at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
> org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-07 Thread Vladimir Klyushnikov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Klyushnikov updated SPARK-32813:
-
Description: 
Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
session was  created in one thread and rdd is being read in another  - so 
InheritableThreadLocal  with active session is not propagated. Code below was 
working perfectly in Spark 2.X, but fails in Spark 3  
{code:scala}
import java.util.concurrent.Executors

import org.apache.spark.sql.SparkSession

import scala.concurrent.{Await, ExecutionContext, Future}
import scala.concurrent.duration._

object Main {

  final case class Data(list: List[Int])

  def main(args: Array[String]): Unit = {

val executor1 = Executors.newSingleThreadExecutor()
val executor2 = Executors.newSingleThreadExecutor()
try {
  val ds = Await.result(Future {
val session = 
SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import session.implicits._

session.createDataset(Data(1 :: Nil) :: 
Nil).write.parquet("test.parquet")
session.read.parquet("test.parquet").as[Data]
  }(ExecutionContext.fromExecutorService(executor1)), 1.minute)

  Await.result(Future {
ds.rdd.collect().foreach(println(_))
  }(ExecutionContext.fromExecutorService(executor2)), 1.minute)

} finally {
  executor1.shutdown()
  executor2.shutdown()
}
  }
}
{code}
This code fails with following exception:
{noformat}
Exception in thread "main" java.util.NoSuchElementException: None.getException 
in thread "main" java.util.NoSuchElementException: None.get at 
scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
{noformat}

  was:
Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
session was  created in one thread and rdd is being read in another  - so 
InheritableThreadLocal  with active session is not propagated. Code below was 
working perfectly in Spark 2.X, but fails in Spark 3  
{code:scala}
import java.util.concurrent.Executors

import org.apache.spark.sql.SparkSession

import scala.concurrent.{Await, ExecutionContext, Future}
import scala.concurrent.duration._

object Main {

  final case class Data(list: List[Int])

  def main(args: Array[String]): Unit = {

val executor1 = Executors.newSingleThreadExecutor()
val executor2 = Executors.newSingleThreadExecutor()
try {
  val ds = Await.result(Future {
val session = 
SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import session.implicits._

session.createDataset(Data(1 :: Nil) :: 
Nil).write.parquet("test.parquet")
session.read.parquet("test.parquet").as[Data]
  }(ExecutionContext.fromExecutorService(executor1)), 1.minute)

  Await.result(Future {
ds.rdd.collect().foreach(println(_))
  }(ExecutionContext.fromExecutorService(executor2)), 1.minute)

} finally {
  executor1.shutdown()
  executor2.shutdown()
}
  }
}
{code}
This code fails with following exception:
{noformat}
Exception in thread "main" java.util.NoSuchElementException: None.getException 
in thread "main" java.util.NoSuchElementException: None.get at 
scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 

[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-07 Thread Vladimir Klyushnikov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Klyushnikov updated SPARK-32813:
-
Description: 
Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
session was  created in one thread and rdd is being read in another  - so 
InheritableThreadLocal  with active session is not propagated. Code below was 
working perfectly in Spark 2.X, but fails in Spark 3  
{code:scala}
import java.util.concurrent.Executors

import org.apache.spark.sql.SparkSession

import scala.concurrent.{Await, ExecutionContext, Future}
import scala.concurrent.duration._

object Main {

  final case class Data(list: List[Int])

  def main(args: Array[String]): Unit = {

val executor1 = Executors.newSingleThreadExecutor()
val executor2 = Executors.newSingleThreadExecutor()
try {
  val ds = Await.result(Future {
val session = 
SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import session.implicits._

session.createDataset(Data(1 :: Nil) :: 
Nil).write.parquet("test.parquet")
session.read.parquet("test.parquet").as[Data]
  }(ExecutionContext.fromExecutorService(executor1)), 1.minute)

  Await.result(Future {
ds.rdd.collect().foreach(println(_))
  }(ExecutionContext.fromExecutorService(executor2)), 1.minute)

} finally {
  executor1.shutdown()
  executor2.shutdown()
}
  }
}
{code}
This code fails with following exception:
{code}
Exception in thread "main" java.util.NoSuchElementException: None.getException 
in thread "main" java.util.NoSuchElementException: None.get at 
scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
{code}


  was:
Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
session was  created in one thread and rdd is being read in another  - so 
InheritableThreadLocal  with active session is not propagated. Code below was 
working perfectly in Spark 2.X, but fails in Spark 3  
{code:scala}
import java.util.concurrent.Executors

import org.apache.spark.sql.SparkSession

import scala.concurrent.{Await, ExecutionContext, Future}
import scala.concurrent.duration._

object Main {

  final case class Data(list: List[Int])

  def main(args: Array[String]): Unit = {

val executor1 = Executors.newSingleThreadExecutor()
val executor2 = Executors.newSingleThreadExecutor()
try {
  val ds = Await.result(Future {
val session = 
SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import session.implicits._

session.createDataset(Data(1 :: Nil) :: 
Nil).write.parquet("test.parquet")
session.read.parquet("test.parquet").as[Data]
  }(ExecutionContext.fromExecutorService(executor1)), 1.minute)

  Await.result(Future {
ds.rdd.collect().foreach(println(_))
  }(ExecutionContext.fromExecutorService(executor2)), 1.minute)

} finally {
  executor1.shutdown()
  executor2.shutdown()
}
  }
}
{code}
This code fails with following exception:
{noformat}
Exception in thread "main" java.util.NoSuchElementException: None.getException 
in thread "main" java.util.NoSuchElementException: None.get at 
scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 

[jira] [Created] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-07 Thread Vladimir Klyushnikov (Jira)
Vladimir Klyushnikov created SPARK-32813:


 Summary: Reading parquet rdd in non columnar mode fails in 
multithreaded environment
 Key: SPARK-32813
 URL: https://issues.apache.org/jira/browse/SPARK-32813
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Vladimir Klyushnikov


Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
session was  created in one thread and rdd is being read in another  - so 
InheritableThreadLocal  with active session is not propagated. Code below was 
working perfectly in Spark 2.X, but fails in Spark 3  
{code:scala}
import java.util.concurrent.Executors

import org.apache.spark.sql.SparkSession

import scala.concurrent.{Await, ExecutionContext, Future}
import scala.concurrent.duration._

object Main {

  final case class Data(list: List[Int])

  def main(args: Array[String]): Unit = {

val executor1 = Executors.newSingleThreadExecutor()
val executor2 = Executors.newSingleThreadExecutor()
try {
  val ds = Await.result(Future {
val session = 
SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import session.implicits._

session.createDataset(Data(1 :: Nil) :: 
Nil).write.parquet("test.parquet")
session.read.parquet("test.parquet").as[Data]
  }(ExecutionContext.fromExecutorService(executor1)), 1.minute)

  Await.result(Future {
ds.rdd.collect().foreach(println(_))
  }(ExecutionContext.fromExecutorService(executor2)), 1.minute)

} finally {
  executor1.shutdown()
  executor2.shutdown()
}
  }
}
{code}
This code fails with following exception:
{noformat}
Exception in thread "main" java.util.NoSuchElementException: None.getException 
in thread "main" java.util.NoSuchElementException: None.get at 
scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE

2020-09-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32753:
---

Assignee: Manu Zhang

> Deduplicating and repartitioning the same column create duplicate rows with 
> AQE
> ---
>
> Key: SPARK-32753
> URL: https://issues.apache.org/jira/browse/SPARK-32753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Major
>
> To reproduce:
> {code:java}
> spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
> val df = spark.sql("select id from v1 group by id distribute by id") 
> println(df.collect().toArray.mkString(","))
> println(df.queryExecution.executedPlan)
> // With AQE
> [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
> AdaptiveSparkPlan(isFinalPlan=true)
> +- CustomShuffleReader local
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(id#183L, 10), true
>  +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
> +- Union
>:- *(1) Range (0, 10, step=1, splits=2)
>+- *(2) Range (0, 10, step=1, splits=2)
> // Without AQE
> [4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
> *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
> +- Exchange hashpartitioning(id#206L, 10), true
>+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
>   +- Union
>  :- *(1) Range (0, 10, step=1, splits=2)
>  +- *(2) Range (0, 10, step=1, splits=2){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE

2020-09-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32753.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29593
[https://github.com/apache/spark/pull/29593]

> Deduplicating and repartitioning the same column create duplicate rows with 
> AQE
> ---
>
> Key: SPARK-32753
> URL: https://issues.apache.org/jira/browse/SPARK-32753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> To reproduce:
> {code:java}
> spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
> val df = spark.sql("select id from v1 group by id distribute by id") 
> println(df.collect().toArray.mkString(","))
> println(df.queryExecution.executedPlan)
> // With AQE
> [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
> AdaptiveSparkPlan(isFinalPlan=true)
> +- CustomShuffleReader local
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(id#183L, 10), true
>  +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
> +- Union
>:- *(1) Range (0, 10, step=1, splits=2)
>+- *(2) Range (0, 10, step=1, splits=2)
> // Without AQE
> [4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
> *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
> +- Exchange hashpartitioning(id#206L, 10), true
>+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
>   +- Union
>  :- *(1) Range (0, 10, step=1, splits=2)
>  +- *(2) Range (0, 10, step=1, splits=2){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191742#comment-17191742
 ] 

Apache Spark commented on SPARK-32810:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29662

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191741#comment-17191741
 ] 

Apache Spark commented on SPARK-32810:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29662

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32812) Run tests script for Python fails on local environment

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32812:
-
Component/s: PySpark

> Run tests script for Python fails on local environment
> --
>
> Key: SPARK-32812
> URL: https://issues.apache.org/jira/browse/SPARK-32812
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Priority: Major
>
> When running PySpark test in the local environment with "python/run-tests" 
> command, the following error could occur.
>  {code}
> Traceback (most recent call last):
>  File "", line 1, in 
> ...
> raise RuntimeError('''
> RuntimeError:
>  An attempt has been made to start a new process before the
>  current process has finished its bootstrapping phase.
> This probably means that you are not using fork to start your
>  child processes and you have forgotten to use the proper idiom
>  in the main module:
> if __name__ == '__main__':
>  freeze_support()
>  ...
> The "freeze_support()" line can be omitted if the program
>  is not going to be frozen to produce an executable.
> Traceback (most recent call last):
> ...
>  raise EOFError
> EOFError
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32812) Run tests script for Python fails on local environment

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32812:
-
Component/s: (was: PySpark)
 Tests

> Run tests script for Python fails on local environment
> --
>
> Key: SPARK-32812
> URL: https://issues.apache.org/jira/browse/SPARK-32812
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Priority: Major
>
> When running PySpark test in the local environment with "python/run-tests" 
> command, the following error could occur.
>  {code}
> Traceback (most recent call last):
>  File "", line 1, in 
> ...
> raise RuntimeError('''
> RuntimeError:
>  An attempt has been made to start a new process before the
>  current process has finished its bootstrapping phase.
> This probably means that you are not using fork to start your
>  child processes and you have forgotten to use the proper idiom
>  in the main module:
> if __name__ == '__main__':
>  freeze_support()
>  ...
> The "freeze_support()" line can be omitted if the program
>  is not going to be frozen to produce an executable.
> Traceback (most recent call last):
> ...
>  raise EOFError
> EOFError
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32812) Run tests script for Python fails on local environment

2020-09-07 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-32812:
---

 Summary: Run tests script for Python fails on local environment
 Key: SPARK-32812
 URL: https://issues.apache.org/jira/browse/SPARK-32812
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Haejoon Lee


When running PySpark test in the local environment with "python/run-tests" 
command, the following error could occur.

 {code}

Traceback (most recent call last):
 File "", line 1, in 
...

raise RuntimeError('''
RuntimeError:
 An attempt has been made to start a new process before the
 current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your
 child processes and you have forgotten to use the proper idiom
 in the main module:

if __name__ == '__main__':
 freeze_support()
 ...

The "freeze_support()" line can be omitted if the program
 is not going to be frozen to produce an executable.
Traceback (most recent call last):
...
 raise EOFError
EOFError

 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32811) Replace IN predicate of continuous range with boundary checks

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191645#comment-17191645
 ] 

Apache Spark commented on SPARK-32811:
--

User 'hotienvu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29661

> Replace IN predicate of continuous range with boundary checks
> -
>
> Key: SPARK-32811
> URL: https://issues.apache.org/jira/browse/SPARK-32811
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Vu Ho
>Priority: Major
>
> This expression 
> {code:java}
> select a from t where a in (1, 2, 3, 3, 4){code}
> should be translated to 
> {code:java}
> select a from t where a >= 1 and a <= 4 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32811) Replace IN predicate of continuous range with boundary checks

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32811:


Assignee: (was: Apache Spark)

> Replace IN predicate of continuous range with boundary checks
> -
>
> Key: SPARK-32811
> URL: https://issues.apache.org/jira/browse/SPARK-32811
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Vu Ho
>Priority: Major
>
> This expression 
> {code:java}
> select a from t where a in (1, 2, 3, 3, 4){code}
> should be translated to 
> {code:java}
> select a from t where a >= 1 and a <= 4 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32811) Replace IN predicate of continuous range with boundary checks

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32811:


Assignee: Apache Spark

> Replace IN predicate of continuous range with boundary checks
> -
>
> Key: SPARK-32811
> URL: https://issues.apache.org/jira/browse/SPARK-32811
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Vu Ho
>Assignee: Apache Spark
>Priority: Major
>
> This expression 
> {code:java}
> select a from t where a in (1, 2, 3, 3, 4){code}
> should be translated to 
> {code:java}
> select a from t where a >= 1 and a <= 4 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32811) Replace IN predicate of continuous range with boundary checks

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191642#comment-17191642
 ] 

Apache Spark commented on SPARK-32811:
--

User 'hotienvu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29661

> Replace IN predicate of continuous range with boundary checks
> -
>
> Key: SPARK-32811
> URL: https://issues.apache.org/jira/browse/SPARK-32811
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Vu Ho
>Priority: Major
>
> This expression 
> {code:java}
> select a from t where a in (1, 2, 3, 3, 4){code}
> should be translated to 
> {code:java}
> select a from t where a >= 1 and a <= 4 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191611#comment-17191611
 ] 

Apache Spark commented on SPARK-32808:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29660

> Pass all `sql/core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32808
> URL: https://issues.apache.org/jira/browse/SPARK-32808
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> Now there are  319 TESTS FAILED based on commit 
> `f5360e761ef161f7e04526b59a4baf53f1cf8cd5`
> {code:java}
> Run completed in 1 hour, 20 minutes, 25 seconds.
> Total number of tests run: 8485
> Suites: completed 357, aborted 0
> Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
> *** 319 TESTS FAILED ***
> {code}
>  
> There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and 
> TPCDS_XXX_PlanStabilityWithStatsSuite:
>  * TPCDSV2_7_PlanStabilitySuite(33 FAILED)
>  * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED)
>  * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED)
>  * TPCDSV1_4_PlanStabilitySuite(92 FAILED)
>  * TPCDSModifiedPlanStabilitySuite(21 FAILED)
>  * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED)
>  
> Other 26 FAILED cases as follow:
>  * StreamingAggregationSuite
>  ** count distinct - state format version 1 
>  ** count distinct - state format version 2 
>  * GeneratorFunctionSuite
>  ** explode and other columns
>  ** explode_outer and other columns
>  * UDFSuite
>  ** SPARK-26308: udf with complex types of decimal
>  ** SPARK-32459: UDF should not fail on WrappedArray
>  * SQLQueryTestSuite
>  ** decimalArithmeticOperations.sql
>  ** postgreSQL/aggregates_part2.sql
>  ** ansi/decimalArithmeticOperations.sql 
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF
>  * WholeStageCodegenSuite
>  ** SPARK-26680: Stream in groupBy does not cause StackOverflowError
>  * DataFrameSuite:
>  ** explode
>  ** SPARK-28067: Aggregate sum should not return wrong results for decimal 
> overflow
>  ** Star Expansion - ds.explode should fail with a meaningful message if it 
> takes a star
>  * DataStreamReaderWriterSuite
>  ** SPARK-18510: use user specified types for partition columns in file 
> sources
>  * OrcV1QuerySuite\OrcV2QuerySuite
>  ** Simple selection form ORC table * 2
>  * ExpressionsSchemaSuite
>  ** Check schemas for expression examples
>  * DataFrameStatSuite
>  ** SPARK-28818: Respect original column nullability in `freqItems`
>  * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite
>  ** SPARK-4228 DataFrame to JSON * 3
>  ** backward compatibility * 3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32808:


Assignee: Apache Spark

> Pass all `sql/core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32808
> URL: https://issues.apache.org/jira/browse/SPARK-32808
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> Now there are  319 TESTS FAILED based on commit 
> `f5360e761ef161f7e04526b59a4baf53f1cf8cd5`
> {code:java}
> Run completed in 1 hour, 20 minutes, 25 seconds.
> Total number of tests run: 8485
> Suites: completed 357, aborted 0
> Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
> *** 319 TESTS FAILED ***
> {code}
>  
> There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and 
> TPCDS_XXX_PlanStabilityWithStatsSuite:
>  * TPCDSV2_7_PlanStabilitySuite(33 FAILED)
>  * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED)
>  * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED)
>  * TPCDSV1_4_PlanStabilitySuite(92 FAILED)
>  * TPCDSModifiedPlanStabilitySuite(21 FAILED)
>  * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED)
>  
> Other 26 FAILED cases as follow:
>  * StreamingAggregationSuite
>  ** count distinct - state format version 1 
>  ** count distinct - state format version 2 
>  * GeneratorFunctionSuite
>  ** explode and other columns
>  ** explode_outer and other columns
>  * UDFSuite
>  ** SPARK-26308: udf with complex types of decimal
>  ** SPARK-32459: UDF should not fail on WrappedArray
>  * SQLQueryTestSuite
>  ** decimalArithmeticOperations.sql
>  ** postgreSQL/aggregates_part2.sql
>  ** ansi/decimalArithmeticOperations.sql 
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF
>  * WholeStageCodegenSuite
>  ** SPARK-26680: Stream in groupBy does not cause StackOverflowError
>  * DataFrameSuite:
>  ** explode
>  ** SPARK-28067: Aggregate sum should not return wrong results for decimal 
> overflow
>  ** Star Expansion - ds.explode should fail with a meaningful message if it 
> takes a star
>  * DataStreamReaderWriterSuite
>  ** SPARK-18510: use user specified types for partition columns in file 
> sources
>  * OrcV1QuerySuite\OrcV2QuerySuite
>  ** Simple selection form ORC table * 2
>  * ExpressionsSchemaSuite
>  ** Check schemas for expression examples
>  * DataFrameStatSuite
>  ** SPARK-28818: Respect original column nullability in `freqItems`
>  * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite
>  ** SPARK-4228 DataFrame to JSON * 3
>  ** backward compatibility * 3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32808:


Assignee: (was: Apache Spark)

> Pass all `sql/core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32808
> URL: https://issues.apache.org/jira/browse/SPARK-32808
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> Now there are  319 TESTS FAILED based on commit 
> `f5360e761ef161f7e04526b59a4baf53f1cf8cd5`
> {code:java}
> Run completed in 1 hour, 20 minutes, 25 seconds.
> Total number of tests run: 8485
> Suites: completed 357, aborted 0
> Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
> *** 319 TESTS FAILED ***
> {code}
>  
> There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and 
> TPCDS_XXX_PlanStabilityWithStatsSuite:
>  * TPCDSV2_7_PlanStabilitySuite(33 FAILED)
>  * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED)
>  * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED)
>  * TPCDSV1_4_PlanStabilitySuite(92 FAILED)
>  * TPCDSModifiedPlanStabilitySuite(21 FAILED)
>  * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED)
>  
> Other 26 FAILED cases as follow:
>  * StreamingAggregationSuite
>  ** count distinct - state format version 1 
>  ** count distinct - state format version 2 
>  * GeneratorFunctionSuite
>  ** explode and other columns
>  ** explode_outer and other columns
>  * UDFSuite
>  ** SPARK-26308: udf with complex types of decimal
>  ** SPARK-32459: UDF should not fail on WrappedArray
>  * SQLQueryTestSuite
>  ** decimalArithmeticOperations.sql
>  ** postgreSQL/aggregates_part2.sql
>  ** ansi/decimalArithmeticOperations.sql 
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF
>  * WholeStageCodegenSuite
>  ** SPARK-26680: Stream in groupBy does not cause StackOverflowError
>  * DataFrameSuite:
>  ** explode
>  ** SPARK-28067: Aggregate sum should not return wrong results for decimal 
> overflow
>  ** Star Expansion - ds.explode should fail with a meaningful message if it 
> takes a star
>  * DataStreamReaderWriterSuite
>  ** SPARK-18510: use user specified types for partition columns in file 
> sources
>  * OrcV1QuerySuite\OrcV2QuerySuite
>  ** Simple selection form ORC table * 2
>  * ExpressionsSchemaSuite
>  ** Check schemas for expression examples
>  * DataFrameStatSuite
>  ** SPARK-28818: Respect original column nullability in `freqItems`
>  * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite
>  ** SPARK-4228 DataFrame to JSON * 3
>  ** backward compatibility * 3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32811) Replace IN predicate of continuous range with boundary checks

2020-09-07 Thread Vu Ho (Jira)
Vu Ho created SPARK-32811:
-

 Summary: Replace IN predicate of continuous range with boundary 
checks
 Key: SPARK-32811
 URL: https://issues.apache.org/jira/browse/SPARK-32811
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Vu Ho


This expression 
{code:java}
select a from t where a in (1, 2, 3, 3, 4){code}
should be translated to 
{code:java}
select a from t where a >= 1 and a <= 4 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32810:


Assignee: (was: Apache Spark)

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191578#comment-17191578
 ] 

Apache Spark commented on SPARK-32810:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29659

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32810:


Assignee: Apache Spark

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32810:


Assignee: Apache Spark

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32785) interval with dangling part should not results null

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191513#comment-17191513
 ] 

Apache Spark commented on SPARK-32785:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29658

> interval with dangling part should not results null
> ---
>
> Key: SPARK-32785
> URL: https://issues.apache.org/jira/browse/SPARK-32785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
>
> bin/spark-sql -S -e "select interval '1', interval '+', interval '1 day -'"
> NULL  NULLNULL
> we should fail these cases correctly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32785) interval with dangling part should not results null

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191511#comment-17191511
 ] 

Apache Spark commented on SPARK-32785:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29658

> interval with dangling part should not results null
> ---
>
> Key: SPARK-32785
> URL: https://issues.apache.org/jira/browse/SPARK-32785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
>
> bin/spark-sql -S -e "select interval '1', interval '+', interval '1 day -'"
> NULL  NULLNULL
> we should fail these cases correctly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191510#comment-17191510
 ] 

Maxim Gekk commented on SPARK-32810:


I am working on this.

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-07 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-32810:
--

 Summary: CSV/JSON data sources should avoid globbing paths when 
inferring schema
 Key: SPARK-32810
 URL: https://issues.apache.org/jira/browse/SPARK-32810
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


The problem is that when the user doesn't specify the schema when reading a CSV 
table, The CSV file format and data source needs to infer schema, and it does 
so by creating a base DataSource relation, and there's a mismatch: 
*FileFormat.inferSchema* expects actual file paths without glob patterns, but 
*DataSource.paths* expects file paths in glob patterns.
 An example is demonstrated below:
{code:java}
^
| DataSource.resolveRelationtries to glob again (incorrectly) on 
glob pattern """[abc].csv"""
| DataSource.apply  ^
|   CSVDataSource.inferSchema   |
| CSVFileFormat.inferSchema |
|   ... |
|   DataSource.resolveRelation  globbed into """[abc].csv""", should be 
treated as verbatim path, not as glob pattern
|   DataSource.apply^
| DataFrameReader.load  |
|   input """\[abc\].csv"""
{code}
The same problem exists in the JSON data source as well. Ditto for MLlib's 
LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32798:


Assignee: Apache Spark

> Make unionByName optionally fill missing columns with nulls in PySpark
> --
>
> Key: SPARK-32798
> URL: https://issues.apache.org/jira/browse/SPARK-32798
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
>
> It would be better to expose {{unionByName}} parameter in Python APIs as 
> well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark

2020-09-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191507#comment-17191507
 ] 

Apache Spark commented on SPARK-32798:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/29657

> Make unionByName optionally fill missing columns with nulls in PySpark
> --
>
> Key: SPARK-32798
> URL: https://issues.apache.org/jira/browse/SPARK-32798
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: starter
>
> It would be better to expose {{unionByName}} parameter in Python APIs as 
> well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark

2020-09-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32798:


Assignee: (was: Apache Spark)

> Make unionByName optionally fill missing columns with nulls in PySpark
> --
>
> Key: SPARK-32798
> URL: https://issues.apache.org/jira/browse/SPARK-32798
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: starter
>
> It would be better to expose {{unionByName}} parameter in Python APIs as 
> well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-32809) RDD different partitions cause didderent results

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-32809:
--

> RDD  different partitions cause didderent results 
> --
>
> Key: SPARK-32809
> URL: https://issues.apache.org/jira/browse/SPARK-32809
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0
>Reporter: zhangchenglong
>Priority: Major
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> {code}
> class Exec3 {
>   private val exec: SparkConf = new 
> SparkConf().setMaster("local[1]").setAppName("exec3")
>   private val context = new SparkContext(exec)
>   context.setCheckpointDir("checkPoint")
>  
>   /**
>* get total number by key 
>* in this project desired results are ("apple",25) ("huwei",20)
>* but in fact i get ("apple",150) ("huawei",20)
>*   when i change it to local[3] the result is correct
>*  i want to know   which cause it and how to slove it 
>*/
>   @Test
>   def testError(): Unit ={
> val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), 
> ("huawei", 20)))
> rdd.aggregateByKey(1.0)(
>   seqOp = (zero, price) => price * zero,
>   combOp = (curr, agg) => curr + agg).collect().foreach(println(_))
> context.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32809) RDD different partitions cause didderent results

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32809.
--
Resolution: Invalid

> RDD  different partitions cause didderent results 
> --
>
> Key: SPARK-32809
> URL: https://issues.apache.org/jira/browse/SPARK-32809
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0
>Reporter: zhangchenglong
>Priority: Major
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> {code}
> class Exec3 {
>   private val exec: SparkConf = new 
> SparkConf().setMaster("local[1]").setAppName("exec3")
>   private val context = new SparkContext(exec)
>   context.setCheckpointDir("checkPoint")
>  
>   /**
>* get total number by key 
>* in this project desired results are ("apple",25) ("huwei",20)
>* but in fact i get ("apple",150) ("huawei",20)
>*   when i change it to local[3] the result is correct
>*  i want to know   which cause it and how to slove it 
>*/
>   @Test
>   def testError(): Unit ={
> val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), 
> ("huawei", 20)))
> rdd.aggregateByKey(1.0)(
>   seqOp = (zero, price) => price * zero,
>   combOp = (curr, agg) => curr + agg).collect().foreach(println(_))
> context.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32809) RDD different partitions cause didderent results

2020-09-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191485#comment-17191485
 ] 

Hyukjin Kwon commented on SPARK-32809:
--

The results is correct. For {{local[1]}}, there's a single partition. So 
{{seqOp}} will be used for {{("apple", 10), ("apple", 15)}}. When {{local[3]}} 
the partitions are three. {{combOp}} will be used to combine each partition.

> RDD  different partitions cause didderent results 
> --
>
> Key: SPARK-32809
> URL: https://issues.apache.org/jira/browse/SPARK-32809
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0
>Reporter: zhangchenglong
>Priority: Major
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> {code}
> class Exec3 {
>   private val exec: SparkConf = new 
> SparkConf().setMaster("local[1]").setAppName("exec3")
>   private val context = new SparkContext(exec)
>   context.setCheckpointDir("checkPoint")
>  
>   /**
>* get total number by key 
>* in this project desired results are ("apple",25) ("huwei",20)
>* but in fact i get ("apple",150) ("huawei",20)
>*   when i change it to local[3] the result is correct
>*  i want to know   which cause it and how to slove it 
>*/
>   @Test
>   def testError(): Unit ={
> val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), 
> ("huawei", 20)))
> rdd.aggregateByKey(1.0)(
>   seqOp = (zero, price) => price * zero,
>   combOp = (curr, agg) => curr + agg).collect().foreach(println(_))
> context.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32809) RDD different partitions cause didderent results

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32809:
-
Description: 
{code}
class Exec3 {
  private val exec: SparkConf = new 
SparkConf().setMaster("local[1]").setAppName("exec3")
  private val context = new SparkContext(exec)
  context.setCheckpointDir("checkPoint")
 
  /**
   * get total number by key 
   * in this project desired results are ("apple",25) ("huwei",20)
   * but in fact i get ("apple",150) ("huawei",20)
   *   when i change it to local[3] the result is correct
   *  i want to know   which cause it and how to slove it 
   */
  @Test
  def testError(): Unit ={
val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 
20)))
rdd.aggregateByKey(1.0)(
  seqOp = (zero, price) => price * zero,
  combOp = (curr, agg) => curr + agg).collect().foreach(println(_))
context.stop()
  }
}
{code}


  was:
class Exec3 {

private val exec: SparkConf = new 
SparkConf().setMaster("local[1]").setAppName("exec3")
 private val context = new SparkContext(exec)
 context.setCheckpointDir("checkPoint")
 
 /**
 * get total number by key 
 * in this project desired results are ("apple",25) ("huwei",20)
 * but in fact i get ("apple",150) ("huawei",20)
 *   when i change it to local[3] the result is correct
*  i want to know   which cause it and how to slove it 
 */
 @Test
 def testError(): Unit ={
 val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 
20)))
 rdd.aggregateByKey(1.0)(
 seqOp = (zero, price) => price * zero,
 combOp = (curr, agg) => curr + agg
 ).collect().foreach(println(_))
 context.stop()
 }
 }


> RDD  different partitions cause didderent results 
> --
>
> Key: SPARK-32809
> URL: https://issues.apache.org/jira/browse/SPARK-32809
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0
>Reporter: zhangchenglong
>Priority: Major
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> {code}
> class Exec3 {
>   private val exec: SparkConf = new 
> SparkConf().setMaster("local[1]").setAppName("exec3")
>   private val context = new SparkContext(exec)
>   context.setCheckpointDir("checkPoint")
>  
>   /**
>* get total number by key 
>* in this project desired results are ("apple",25) ("huwei",20)
>* but in fact i get ("apple",150) ("huawei",20)
>*   when i change it to local[3] the result is correct
>*  i want to know   which cause it and how to slove it 
>*/
>   @Test
>   def testError(): Unit ={
> val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), 
> ("huawei", 20)))
> rdd.aggregateByKey(1.0)(
>   seqOp = (zero, price) => price * zero,
>   combOp = (curr, agg) => curr + agg).collect().foreach(println(_))
> context.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13

2020-09-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-32808:
-
Description: 
Now there are  319 TESTS FAILED based on commit 
`f5360e761ef161f7e04526b59a4baf53f1cf8cd5`
{code:java}
Run completed in 1 hour, 20 minutes, 25 seconds.
Total number of tests run: 8485
Suites: completed 357, aborted 0
Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
*** 319 TESTS FAILED ***
{code}
 

There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and 
TPCDS_XXX_PlanStabilityWithStatsSuite:
 * TPCDSV2_7_PlanStabilitySuite(33 FAILED)
 * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED)
 * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED)
 * TPCDSV1_4_PlanStabilitySuite(92 FAILED)
 * TPCDSModifiedPlanStabilitySuite(21 FAILED)
 * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED)

 

Other 26 FAILED cases as follow:
 * StreamingAggregationSuite
 ** count distinct - state format version 1 
 ** count distinct - state format version 2 
 * GeneratorFunctionSuite
 ** explode and other columns
 ** explode_outer and other columns
 * UDFSuite
 ** SPARK-26308: udf with complex types of decimal
 ** SPARK-32459: UDF should not fail on WrappedArray
 * SQLQueryTestSuite
 ** decimalArithmeticOperations.sql
 ** postgreSQL/aggregates_part2.sql
 ** ansi/decimalArithmeticOperations.sql 
 ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF
 ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF
 * WholeStageCodegenSuite
 ** SPARK-26680: Stream in groupBy does not cause StackOverflowError
 * DataFrameSuite:
 ** explode
 ** SPARK-28067: Aggregate sum should not return wrong results for decimal 
overflow
 ** Star Expansion - ds.explode should fail with a meaningful message if it 
takes a star
 * DataStreamReaderWriterSuite
 ** SPARK-18510: use user specified types for partition columns in file sources
 * OrcV1QuerySuite\OrcV2QuerySuite
 ** Simple selection form ORC table * 2
 * ExpressionsSchemaSuite
 ** Check schemas for expression examples
 * DataFrameStatSuite
 ** SPARK-28818: Respect original column nullability in `freqItems`
 * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite
 ** SPARK-4228 DataFrame to JSON * 3
 ** backward compatibility * 3

  was:
Now there are  319 TESTS FAILED based on commit 
`f5360e761ef161f7e04526b59a4baf53f1cf8cd5`
{code:java}
Run completed in 1 hour, 20 minutes, 25 seconds.
Total number of tests run: 8485
Suites: completed 357, aborted 0
Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
*** 319 TESTS FAILED ***
{code}


> Pass all `sql/core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32808
> URL: https://issues.apache.org/jira/browse/SPARK-32808
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> Now there are  319 TESTS FAILED based on commit 
> `f5360e761ef161f7e04526b59a4baf53f1cf8cd5`
> {code:java}
> Run completed in 1 hour, 20 minutes, 25 seconds.
> Total number of tests run: 8485
> Suites: completed 357, aborted 0
> Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
> *** 319 TESTS FAILED ***
> {code}
>  
> There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and 
> TPCDS_XXX_PlanStabilityWithStatsSuite:
>  * TPCDSV2_7_PlanStabilitySuite(33 FAILED)
>  * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED)
>  * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED)
>  * TPCDSV1_4_PlanStabilitySuite(92 FAILED)
>  * TPCDSModifiedPlanStabilitySuite(21 FAILED)
>  * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED)
>  
> Other 26 FAILED cases as follow:
>  * StreamingAggregationSuite
>  ** count distinct - state format version 1 
>  ** count distinct - state format version 2 
>  * GeneratorFunctionSuite
>  ** explode and other columns
>  ** explode_outer and other columns
>  * UDFSuite
>  ** SPARK-26308: udf with complex types of decimal
>  ** SPARK-32459: UDF should not fail on WrappedArray
>  * SQLQueryTestSuite
>  ** decimalArithmeticOperations.sql
>  ** postgreSQL/aggregates_part2.sql
>  ** ansi/decimalArithmeticOperations.sql 
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF
>  * WholeStageCodegenSuite
>  ** SPARK-26680: Stream in groupBy does not cause StackOverflowError
>  * DataFrameSuite:
>  ** explode
>  ** SPARK-28067: Aggregate sum should not return wrong results for decimal 
> overflow
>  ** Star Expansion - ds.explode should fail with a meaningful message if it 
> takes a star
>  * DataStreamReaderWriterSuite
>  ** SPARK-18510: use user specified types for partition columns in file 
> sources
>  * OrcV1QuerySuite\OrcV2QuerySuite
>  ** Simple selection form ORC table * 2
>  * ExpressionsSchemaSuite
>  ** Check 

[jira] [Updated] (SPARK-32809) RDD different partitions cause didderent results

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32809:
-
Priority: Major  (was: Blocker)

> RDD  different partitions cause didderent results 
> --
>
> Key: SPARK-32809
> URL: https://issues.apache.org/jira/browse/SPARK-32809
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0
>Reporter: zhangchenglong
>Priority: Major
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> class Exec3 {
> private val exec: SparkConf = new 
> SparkConf().setMaster("local[1]").setAppName("exec3")
>  private val context = new SparkContext(exec)
>  context.setCheckpointDir("checkPoint")
>  
>  /**
>  * get total number by key 
>  * in this project desired results are ("apple",25) ("huwei",20)
>  * but in fact i get ("apple",150) ("huawei",20)
>  *   when i change it to local[3] the result is correct
> *  i want to know   which cause it and how to slove it 
>  */
>  @Test
>  def testError(): Unit ={
>  val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 
> 20)))
>  rdd.aggregateByKey(1.0)(
>  seqOp = (zero, price) => price * zero,
>  combOp = (curr, agg) => curr + agg
>  ).collect().foreach(println(_))
>  context.stop()
>  }
>  }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32809) RDD different partitions cause didderent results

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32809:
-
Issue Type: Bug  (was: Wish)

> RDD  different partitions cause didderent results 
> --
>
> Key: SPARK-32809
> URL: https://issues.apache.org/jira/browse/SPARK-32809
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0
>Reporter: zhangchenglong
>Priority: Major
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> class Exec3 {
> private val exec: SparkConf = new 
> SparkConf().setMaster("local[1]").setAppName("exec3")
>  private val context = new SparkContext(exec)
>  context.setCheckpointDir("checkPoint")
>  
>  /**
>  * get total number by key 
>  * in this project desired results are ("apple",25) ("huwei",20)
>  * but in fact i get ("apple",150) ("huawei",20)
>  *   when i change it to local[3] the result is correct
> *  i want to know   which cause it and how to slove it 
>  */
>  @Test
>  def testError(): Unit ={
>  val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 
> 20)))
>  rdd.aggregateByKey(1.0)(
>  seqOp = (zero, price) => price * zero,
>  combOp = (curr, agg) => curr + agg
>  ).collect().foreach(println(_))
>  context.stop()
>  }
>  }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32809) RDD different partitions cause didderent results

2020-09-07 Thread zhangchenglong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangchenglong updated SPARK-32809:
---
 Flags: Important
   Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0
Issue Type: Wish  (was: Bug)
  Priority: Blocker  (was: Major)
   Summary: RDD  different partitions cause didderent results   
(was: RDD分区数对于计算结果的影响)
Remaining Estimate: 12h
 Original Estimate: 12h

the desired result is  ("apple",25) ("huawei",20)
but  i get ("apple",150) ("huawei",20)

> RDD  different partitions cause didderent results 
> --
>
> Key: SPARK-32809
> URL: https://issues.apache.org/jira/browse/SPARK-32809
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0
>Reporter: zhangchenglong
>Priority: Blocker
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> class Exec3 {
> private val exec: SparkConf = new 
> SparkConf().setMaster("local[1]").setAppName("exec3")
>  private val context = new SparkContext(exec)
>  context.setCheckpointDir("checkPoint")
>  
>  /**
>  * get total number by key 
>  * in this project desired results are ("apple",25) ("huwei",20)
>  * but in fact i get ("apple",150) ("huawei",20)
>  *   when i change it to local[3] the result is correct
> *  i want to know   which cause it and how to slove it 
>  */
>  @Test
>  def testError(): Unit ={
>  val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 
> 20)))
>  rdd.aggregateByKey(1.0)(
>  seqOp = (zero, price) => price * zero,
>  combOp = (curr, agg) => curr + agg
>  ).collect().foreach(println(_))
>  context.stop()
>  }
>  }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32809) RDD分区数对于计算结果的影响

2020-09-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191475#comment-17191475
 ] 

Hyukjin Kwon commented on SPARK-32809:
--

Please also fix the title. 
What's the expected results and actual results?
Also, Spark 2.2 is EOL which doesn't have maintenence releases anymore. It 
would be great if it can be still reproduced in Spark 2.4+.

> RDD分区数对于计算结果的影响
> ---
>
> Key: SPARK-32809
> URL: https://issues.apache.org/jira/browse/SPARK-32809
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: zhangchenglong
>Priority: Major
>
> class Exec3 {
> private val exec: SparkConf = new 
> SparkConf().setMaster("local[1]").setAppName("exec3")
>  private val context = new SparkContext(exec)
>  context.setCheckpointDir("checkPoint")
>  
>  /**
>  * get total number by key 
>  * in this project desired results are ("apple",25) ("huwei",20)
>  * but in fact i get ("apple",150) ("huawei",20)
>  *   when i change it to local[3] the result is correct
> *  i want to know   which cause it and how to slove it 
>  */
>  @Test
>  def testError(): Unit ={
>  val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 
> 20)))
>  rdd.aggregateByKey(1.0)(
>  seqOp = (zero, price) => price * zero,
>  combOp = (curr, agg) => curr + agg
>  ).collect().foreach(println(_))
>  context.stop()
>  }
>  }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32809) RDD分区数对于计算结果的影响

2020-09-07 Thread zhangchenglong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangchenglong updated SPARK-32809:
---
Description: 
class Exec3 {

private val exec: SparkConf = new 
SparkConf().setMaster("local[1]").setAppName("exec3")
 private val context = new SparkContext(exec)
 context.setCheckpointDir("checkPoint")
 
 /**
 * get total number by key 
 * in this project desired results are ("apple",25) ("huwei",20)
 * but in fact i get ("apple",150) ("huawei",20)
 *   when i change it to local[3] the result is correct
*  i want to know   which cause it and how to slove it 
 */
 @Test
 def testError(): Unit ={
 val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 
20)))
 rdd.aggregateByKey(1.0)(
 seqOp = (zero, price) => price * zero,
 combOp = (curr, agg) => curr + agg
 ).collect().foreach(println(_))
 context.stop()
 }
 }

  was:
class Exec3 {

private val exec: SparkConf = new 
SparkConf().setMaster("local[1]").setAppName("exec3")
 private val context = new SparkContext(exec)
 context.setCheckpointDir("checkPoint")
 
 /**
 * get total number by key 
 * in this project desired results are ("苹果",25) ("华为",20)
 * but in fact i get ("苹果",150) ("华为",20)
 *   when i change it to local[3] the result is correct
*  i want to know   which cause it and how to slove it 
 */
 @Test
 def testError(): Unit ={
 val rdd = context.parallelize(Seq(("苹果", 10), ("苹果", 15), ("华为", 20)))
 rdd.aggregateByKey(1.0)(
 seqOp = (zero, price) => price * zero,
 combOp = (curr, agg) => curr + agg
 ).collect().foreach(println(_))
 context.stop()
 }
 }


> RDD分区数对于计算结果的影响
> ---
>
> Key: SPARK-32809
> URL: https://issues.apache.org/jira/browse/SPARK-32809
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: zhangchenglong
>Priority: Major
>
> class Exec3 {
> private val exec: SparkConf = new 
> SparkConf().setMaster("local[1]").setAppName("exec3")
>  private val context = new SparkContext(exec)
>  context.setCheckpointDir("checkPoint")
>  
>  /**
>  * get total number by key 
>  * in this project desired results are ("apple",25) ("huwei",20)
>  * but in fact i get ("apple",150) ("huawei",20)
>  *   when i change it to local[3] the result is correct
> *  i want to know   which cause it and how to slove it 
>  */
>  @Test
>  def testError(): Unit ={
>  val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 
> 20)))
>  rdd.aggregateByKey(1.0)(
>  seqOp = (zero, price) => price * zero,
>  combOp = (curr, agg) => curr + agg
>  ).collect().foreach(println(_))
>  context.stop()
>  }
>  }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec

2020-09-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32748:
---

Assignee: Zhenhua Wang

> Support local property propagation in SubqueryBroadcastExec
> ---
>
> Key: SPARK-32748
> URL: https://issues.apache.org/jira/browse/SPARK-32748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
>
> Since SPARK-22590, local property propagation is supported through 
> `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
> `SubqueryExec` when computing `relationFuture`.
> The propagation is missed in `SubqueryBroadcastExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec

2020-09-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32748.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29589
[https://github.com/apache/spark/pull/29589]

> Support local property propagation in SubqueryBroadcastExec
> ---
>
> Key: SPARK-32748
> URL: https://issues.apache.org/jira/browse/SPARK-32748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Since SPARK-22590, local property propagation is supported through 
> `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
> `SubqueryExec` when computing `relationFuture`.
> The propagation is missed in `SubqueryBroadcastExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-07 Thread Aman Rastogi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191472#comment-17191472
 ] 

Aman Rastogi commented on SPARK-32778:
--

[~hyukjin.kwon] Sure, will reproduce with 2.4

> Accidental Data Deletion on calling saveAsTable
> ---
>
> Key: SPARK-32778
> URL: https://issues.apache.org/jira/browse/SPARK-32778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aman Rastogi
>Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving 
> same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to 
> ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
> false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, 
> tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete 
> data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32809) RDD分区数对于计算结果的影响

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32809.
--
Resolution: Incomplete

Please write in English which the community use to communicate.

> RDD分区数对于计算结果的影响
> ---
>
> Key: SPARK-32809
> URL: https://issues.apache.org/jira/browse/SPARK-32809
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: zhangchenglong
>Priority: Major
>
> class Exec3 {
> private val exec: SparkConf = new 
> SparkConf().setMaster("local[1]").setAppName("exec3")
>  private val context = new SparkContext(exec)
>  context.setCheckpointDir("checkPoint")
>  
>  /**
>  * get total number by key 
>  * in this project desired results are ("苹果",25) ("华为",20)
>  * but in fact i get ("苹果",150) ("华为",20)
>  *   when i change it to local[3] the result is correct
> *  i want to know   which cause it and how to slove it 
>  */
>  @Test
>  def testError(): Unit ={
>  val rdd = context.parallelize(Seq(("苹果", 10), ("苹果", 15), ("华为", 20)))
>  rdd.aggregateByKey(1.0)(
>  seqOp = (zero, price) => price * zero,
>  combOp = (curr, agg) => curr + agg
>  ).collect().foreach(println(_))
>  context.stop()
>  }
>  }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191468#comment-17191468
 ] 

Hyukjin Kwon commented on SPARK-32778:
--

Spark 2.3 is EOL too .. can you reproduce in Spark 2.4 or 3.0?

> Accidental Data Deletion on calling saveAsTable
> ---
>
> Key: SPARK-32778
> URL: https://issues.apache.org/jira/browse/SPARK-32778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aman Rastogi
>Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving 
> same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to 
> ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
> false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, 
> tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete 
> data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32807) Spark ThriftServer multisession mode set DB use direct API

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32807:
-
Issue Type: Improvement  (was: Bug)

> Spark ThriftServer multisession mode set DB use direct API
> --
>
> Key: SPARK-32807
> URL: https://issues.apache.org/jira/browse/SPARK-32807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> When spark thrift server multi session mode, if we define init default db, it 
> will call `use db` to set default db, when there is high concurrent, it's slow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32807) Spark ThriftServer multisession mode set DB use direct API

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32807:
-
Priority: Trivial  (was: Major)

> Spark ThriftServer multisession mode set DB use direct API
> --
>
> Key: SPARK-32807
> URL: https://issues.apache.org/jira/browse/SPARK-32807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Trivial
>
> When spark thrift server multi session mode, if we define init default db, it 
> will call `use db` to set default db, when there is high concurrent, it's slow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32779.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29649
[https://github.com/apache/spark/pull/29649]

> Spark/Hive3 interaction potentially causes deadlock
> ---
>
> Key: SPARK-32779
> URL: https://issues.apache.org/jira/browse/SPARK-32779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> This is an issue for applications that share a Spark Session across multiple 
> threads.
> sessionCatalog.loadPartition (after checking that the table exists) grabs 
> locks in this order:
>  - HiveExternalCatalog
>  - HiveSessionCatalog (in Shim_v3_0)
> Other operations (e.g., sessionCatalog.tableExists), grab locks in this order:
>  - HiveSessionCatalog
>  - HiveExternalCatalog
> [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332]
>  appears to be the culprit. Maybe db name should be defaulted _before_ the 
> call to HiveClient so that Shim_v3_0 doesn't have to call back into 
> SessionCatalog. Or possibly this is not needed at all, since loadPartition in 
> Shim_v2_1 doesn't worry about the default db name, but that might be because 
> of differences between Hive client libraries.
> Reproduction case:
>  - You need to have a running Hive 3.x HMS instance and the appropriate 
> hive-site.xml for your Spark instance
>  - Adjust your spark.sql.hive.metastore.version accordingly
>  - It might take more than one try to hit the deadlock
> Launch Spark:
> {noformat}
> bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" 
> --conf spark.sql.hive.metastore.version=3.1
> {noformat}
> Then use the following code:
> {noformat}
> import scala.collection.mutable.ArrayBuffer
> import scala.util.Random
> val tableCount = 4
> for (i <- 0 until tableCount) {
>   val tableName = s"partitioned${i+1}"
>   sql(s"drop table if exists $tableName")
>   sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored 
> as orc")
> }
> val threads = new ArrayBuffer[Thread]
> for (i <- 0 until tableCount) {
>   threads.append(new Thread( new Runnable {
> override def run: Unit = {
>   val tableName = s"partitioned${i + 1}"
>   val rand = Random
>   val df = spark.range(0, 2).toDF("a")
>   val location = s"/tmp/${rand.nextLong.abs}"
>   df.write.mode("overwrite").orc(location)
>   sql(
> s"""
> LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition 
> (b=$i)""")
> }
>   }, s"worker$i"))
>   threads(i).start()
> }
> for (i <- 0 until tableCount) {
>   println(s"Joining with thread $i")
>   threads(i).join()
> }
> println("All done")
> {noformat}
> The job often gets stuck after one or two "Joining..." lines.
> {{kill -3}} shows something like this:
> {noformat}
> Found one Java-level deadlock:
> =
> "worker3":
>   waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a 
> org.apache.spark.sql.hive.HiveSessionCatalog),
>   which is held by "worker0"
> "worker0":
>   waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a 
> org.apache.spark.sql.hive.HiveExternalCatalog),
>   which is held by "worker3"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32779:


Assignee: Sandeep Katta

> Spark/Hive3 interaction potentially causes deadlock
> ---
>
> Key: SPARK-32779
> URL: https://issues.apache.org/jira/browse/SPARK-32779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Assignee: Sandeep Katta
>Priority: Major
>
> This is an issue for applications that share a Spark Session across multiple 
> threads.
> sessionCatalog.loadPartition (after checking that the table exists) grabs 
> locks in this order:
>  - HiveExternalCatalog
>  - HiveSessionCatalog (in Shim_v3_0)
> Other operations (e.g., sessionCatalog.tableExists), grab locks in this order:
>  - HiveSessionCatalog
>  - HiveExternalCatalog
> [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332]
>  appears to be the culprit. Maybe db name should be defaulted _before_ the 
> call to HiveClient so that Shim_v3_0 doesn't have to call back into 
> SessionCatalog. Or possibly this is not needed at all, since loadPartition in 
> Shim_v2_1 doesn't worry about the default db name, but that might be because 
> of differences between Hive client libraries.
> Reproduction case:
>  - You need to have a running Hive 3.x HMS instance and the appropriate 
> hive-site.xml for your Spark instance
>  - Adjust your spark.sql.hive.metastore.version accordingly
>  - It might take more than one try to hit the deadlock
> Launch Spark:
> {noformat}
> bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" 
> --conf spark.sql.hive.metastore.version=3.1
> {noformat}
> Then use the following code:
> {noformat}
> import scala.collection.mutable.ArrayBuffer
> import scala.util.Random
> val tableCount = 4
> for (i <- 0 until tableCount) {
>   val tableName = s"partitioned${i+1}"
>   sql(s"drop table if exists $tableName")
>   sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored 
> as orc")
> }
> val threads = new ArrayBuffer[Thread]
> for (i <- 0 until tableCount) {
>   threads.append(new Thread( new Runnable {
> override def run: Unit = {
>   val tableName = s"partitioned${i + 1}"
>   val rand = Random
>   val df = spark.range(0, 2).toDF("a")
>   val location = s"/tmp/${rand.nextLong.abs}"
>   df.write.mode("overwrite").orc(location)
>   sql(
> s"""
> LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition 
> (b=$i)""")
> }
>   }, s"worker$i"))
>   threads(i).start()
> }
> for (i <- 0 until tableCount) {
>   println(s"Joining with thread $i")
>   threads(i).join()
> }
> println("All done")
> {noformat}
> The job often gets stuck after one or two "Joining..." lines.
> {{kill -3}} shows something like this:
> {noformat}
> Found one Java-level deadlock:
> =
> "worker3":
>   waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a 
> org.apache.spark.sql.hive.HiveSessionCatalog),
>   which is held by "worker0"
> "worker0":
>   waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a 
> org.apache.spark.sql.hive.HiveExternalCatalog),
>   which is held by "worker3"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock

2020-09-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32779:
-
Fix Version/s: (was: 3.0.1)
   3.0.2

> Spark/Hive3 interaction potentially causes deadlock
> ---
>
> Key: SPARK-32779
> URL: https://issues.apache.org/jira/browse/SPARK-32779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> This is an issue for applications that share a Spark Session across multiple 
> threads.
> sessionCatalog.loadPartition (after checking that the table exists) grabs 
> locks in this order:
>  - HiveExternalCatalog
>  - HiveSessionCatalog (in Shim_v3_0)
> Other operations (e.g., sessionCatalog.tableExists), grab locks in this order:
>  - HiveSessionCatalog
>  - HiveExternalCatalog
> [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332]
>  appears to be the culprit. Maybe db name should be defaulted _before_ the 
> call to HiveClient so that Shim_v3_0 doesn't have to call back into 
> SessionCatalog. Or possibly this is not needed at all, since loadPartition in 
> Shim_v2_1 doesn't worry about the default db name, but that might be because 
> of differences between Hive client libraries.
> Reproduction case:
>  - You need to have a running Hive 3.x HMS instance and the appropriate 
> hive-site.xml for your Spark instance
>  - Adjust your spark.sql.hive.metastore.version accordingly
>  - It might take more than one try to hit the deadlock
> Launch Spark:
> {noformat}
> bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" 
> --conf spark.sql.hive.metastore.version=3.1
> {noformat}
> Then use the following code:
> {noformat}
> import scala.collection.mutable.ArrayBuffer
> import scala.util.Random
> val tableCount = 4
> for (i <- 0 until tableCount) {
>   val tableName = s"partitioned${i+1}"
>   sql(s"drop table if exists $tableName")
>   sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored 
> as orc")
> }
> val threads = new ArrayBuffer[Thread]
> for (i <- 0 until tableCount) {
>   threads.append(new Thread( new Runnable {
> override def run: Unit = {
>   val tableName = s"partitioned${i + 1}"
>   val rand = Random
>   val df = spark.range(0, 2).toDF("a")
>   val location = s"/tmp/${rand.nextLong.abs}"
>   df.write.mode("overwrite").orc(location)
>   sql(
> s"""
> LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition 
> (b=$i)""")
> }
>   }, s"worker$i"))
>   threads(i).start()
> }
> for (i <- 0 until tableCount) {
>   println(s"Joining with thread $i")
>   threads(i).join()
> }
> println("All done")
> {noformat}
> The job often gets stuck after one or two "Joining..." lines.
> {{kill -3}} shows something like this:
> {noformat}
> Found one Java-level deadlock:
> =
> "worker3":
>   waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a 
> org.apache.spark.sql.hive.HiveSessionCatalog),
>   which is held by "worker0"
> "worker0":
>   waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a 
> org.apache.spark.sql.hive.HiveExternalCatalog),
>   which is held by "worker3"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32809) RDD分区数对于计算结果的影响

2020-09-07 Thread zhangchenglong (Jira)
zhangchenglong created SPARK-32809:
--

 Summary: RDD分区数对于计算结果的影响
 Key: SPARK-32809
 URL: https://issues.apache.org/jira/browse/SPARK-32809
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: zhangchenglong


class Exec3 {

private val exec: SparkConf = new 
SparkConf().setMaster("local[1]").setAppName("exec3")
 private val context = new SparkContext(exec)
 context.setCheckpointDir("checkPoint")
 
 /**
 * get total number by key 
 * in this project desired results are ("苹果",25) ("华为",20)
 * but in fact i get ("苹果",150) ("华为",20)
 *   when i change it to local[3] the result is correct
*  i want to know   which cause it and how to slove it 
 */
 @Test
 def testError(): Unit ={
 val rdd = context.parallelize(Seq(("苹果", 10), ("苹果", 15), ("华为", 20)))
 rdd.aggregateByKey(1.0)(
 seqOp = (zero, price) => price * zero,
 combOp = (curr, agg) => curr + agg
 ).collect().foreach(println(_))
 context.stop()
 }
 }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter

2020-09-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191457#comment-17191457
 ] 

Takeshi Yamamuro commented on SPARK-32793:
--

You don't add it into SQL? I think BigQuery has a similar feature: 
[https://cloud.google.com/bigquery/docs/reference/standard-sql/debugging-statements]

> Expose assert_true in Python/Scala APIs and add error message parameter
> ---
>
> Key: SPARK-32793
> URL: https://issues.apache.org/jira/browse/SPARK-32793
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Priority: Minor
>
> # Add RAISEERROR() (or RAISE_ERROR()) to the API
>  # Add Scala/Python/R version of API for ASSERT_TRUE()
>  # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which 
> the `message` parameter is only lazily evaluated when the condition is not 
> true
>  # Change the implementation of ASSERT_TRUE() to be rewritten during 
> optimization to IF() instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32677) Load function resource before create

2020-09-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32677:
---

Assignee: ulysses you

> Load function resource before create
> 
>
> Key: SPARK-32677
> URL: https://issues.apache.org/jira/browse/SPARK-32677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> Change `CreateFunctionCommand` code that add class check before create 
> function and add to function registry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32677) Load function resource before create

2020-09-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32677.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29502
[https://github.com/apache/spark/pull/29502]

> Load function resource before create
> 
>
> Key: SPARK-32677
> URL: https://issues.apache.org/jira/browse/SPARK-32677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.0
>
>
> Change `CreateFunctionCommand` code that add class check before create 
> function and add to function registry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org