[jira] [Commented] (SPARK-26819) ArrayIndexOutOfBoundsException while loading a CSV to a Dataset with dependencies spark-core_2.12 and spark-sql_2.12 (with spark-core_2.11 and spark-sql_2.11 : working

2019-02-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759617#comment-16759617
 ] 

Hyukjin Kwon commented on SPARK-26819:
--

It works fine in Spark. Can you clarify what

{quote}
 if Spark 2.4.0 is associated to dependencies spark-spark-core_2.12 and 
spark-sql_2.12, but works fine with spark-core_2.11 and spark-sql_2.11.
{quote}

menas?

> ArrayIndexOutOfBoundsException while loading a CSV to a Dataset with 
> dependencies spark-core_2.12 and spark-sql_2.12 (with spark-core_2.11 and 
> spark-sql_2.11 : working fine)
> -
>
> Key: SPARK-26819
> URL: https://issues.apache.org/jira/browse/SPARK-26819
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
> Environment: Java 8, Windows 7.
>Reporter: M. Le Bihan
>Priority: Major
> Attachments: CompteResultatCSV.java, ComptesResultatsIT.java, 
> comptes-communes-Entr'Allier.csv
>
>
> A simple CSV reading to a Dataset fails if Spark 2.4.0 is associated to 
> dependencies spark-spark-core_2.12 and spark-sql_2.12, but works fine with 
> spark-core_2.11 and spark-sql_2.11.
>  
> With _2.12, I encounter this stacktrace :
>  
> {{java.lang.ArrayIndexOutOfBoundsException: 10582}}
> {{ at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)}}
> {{ at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)}}
> {{ at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)}}
> {{ at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)}}
> {{ at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)}}
> {{ at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)}}
> {{ at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)}}
> {{ at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240)}}
> {{ at scala.collection.Iterator.foreach(Iterator.scala:937)}}
> {{ at scala.collection.Iterator.foreach$(Iterator.scala:937)}}
> {{ at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)}}
> {{ at scala.collection.IterableLike.foreach(IterableLike.scala:70)}}
> {{ at scala.collection.IterableLike.foreach$(IterableLike.scala:69)}}
> {{ at scala.collection.AbstractIterable.foreach(Iterable.scala:54)}}
> {{ at scala.collection.TraversableLike.flatMap(TraversableLike.scala:240)}}
> {{ at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:237)}}
> {{ at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)}}
> {{ at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:58)}}
> {{ at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:176)}}
> {{ at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)}}
> {{ at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)}}
> {{ at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)}}
> {{ at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:194)}}
> {{ at scala.collection.TraversableLike.map(TraversableLike.scala:233)}}
> {{ at scala.collection.TraversableLike.map$(TraversableLike.scala:226)}}
> {{ at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:194)}}
> {{ at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:170)}}
> {{ at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:169)}}
> {{ at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240)}}
> {{ at scala.collection.immutable.List.foreach(List.scala:388)}}
> {{ at scala.collection.TraversableLike.flatMap(TraversableLike.scala:240)}}
> {{ at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:237)}}
> {{ at scala.collection.immutable.List.flatMap(List.scala:351)}}
> {{ at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:169)}}
> {{ at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:21)}}
> {{ at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:29)}}
> {{ at 

[jira] [Assigned] (SPARK-26603) Update minikube backend in K8s integration tests

2019-02-03 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-26603:


Assignee: Stavros Kontopoulos

> Update minikube backend in K8s integration tests
> 
>
> Key: SPARK-26603
> URL: https://issues.apache.org/jira/browse/SPARK-26603
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
>
> Minikube status command has changed 
> ([https://github.com/kubernetes/minikube/commit/cb3624dd089e7ab0c03fbfb81f20c2bde43a60f3#diff-bd0534bbb0703b4170d467d074373788])
>  in the latest releases >0.30.
> Old output:
> {quote}minikube status
>  There is a newer version of minikube available (v0.31.0). Download it here:
>  [https://github.com/kubernetes/minikube/releases/tag/v0.31.0]
> To disable this notification, run the following:
>  minikube config set WantUpdateNotification false
>  minikube: 
>  cluster: 
>  kubectl: 
> {quote}
> new output:
> {quote}minikube status
>  host: Running
>  kubelet: Running
>  apiserver: Running
>  kubectl: Correctly Configured: pointing to minikube-vm at 172.31.34.77
> {quote}
> That means users with latest version of minikube will not be able to run the 
> integration tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26603) Update minikube backend in K8s integration tests

2019-02-03 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-26603.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

> Update minikube backend in K8s integration tests
> 
>
> Key: SPARK-26603
> URL: https://issues.apache.org/jira/browse/SPARK-26603
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Minikube status command has changed 
> ([https://github.com/kubernetes/minikube/commit/cb3624dd089e7ab0c03fbfb81f20c2bde43a60f3#diff-bd0534bbb0703b4170d467d074373788])
>  in the latest releases >0.30.
> Old output:
> {quote}minikube status
>  There is a newer version of minikube available (v0.31.0). Download it here:
>  [https://github.com/kubernetes/minikube/releases/tag/v0.31.0]
> To disable this notification, run the following:
>  minikube config set WantUpdateNotification false
>  minikube: 
>  cluster: 
>  kubectl: 
> {quote}
> new output:
> {quote}minikube status
>  host: Running
>  kubelet: Running
>  apiserver: Running
>  kubectl: Correctly Configured: pointing to minikube-vm at 172.31.34.77
> {quote}
> That means users with latest version of minikube will not be able to run the 
> integration tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message

2019-02-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26810.
--
Resolution: Duplicate

SPARK-23299 (I think

> Fixing SPARK-25072 broke existing code and fails to show error message
> --
>
> Key: SPARK-26810
> URL: https://issues.apache.org/jira/browse/SPARK-26810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Minor
>
> Hey,
> We upgraded Spark recently, and 
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
> after the upgrade. Annoyingly, the error message formatting also threw an 
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
> "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
> return "" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The 
> unexpected error came from __repr__ as it assumes that the arguments given to 
> Row constructor are strings. That sounds like a reasonable assumption, should 
> the Row constructor validate that it holds true maybe? (I guess that might be 
> another potentially breaking change though, if someone has as weird code as 
> this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message

2019-02-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759571#comment-16759571
 ] 

Hyukjin Kwon commented on SPARK-26810:
--

I'm leaving this as a duplicate of SPARK-23299. Thanks for reporting this with 
detailed info.

> Fixing SPARK-25072 broke existing code and fails to show error message
> --
>
> Key: SPARK-26810
> URL: https://issues.apache.org/jira/browse/SPARK-26810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Minor
>
> Hey,
> We upgraded Spark recently, and 
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
> after the upgrade. Annoyingly, the error message formatting also threw an 
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
> "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
> return "" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The 
> unexpected error came from __repr__ as it assumes that the arguments given to 
> Row constructor are strings. That sounds like a reasonable assumption, should 
> the Row constructor validate that it holds true maybe? (I guess that might be 
> another potentially breaking change though, if someone has as weird code as 
> this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message

2019-02-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26810:
-
Comment: was deleted

(was: SPARK-23299 (I think)

> Fixing SPARK-25072 broke existing code and fails to show error message
> --
>
> Key: SPARK-26810
> URL: https://issues.apache.org/jira/browse/SPARK-26810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Minor
>
> Hey,
> We upgraded Spark recently, and 
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
> after the upgrade. Annoyingly, the error message formatting also threw an 
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
> "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
> return "" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The 
> unexpected error came from __repr__ as it assumes that the arguments given to 
> Row constructor are strings. That sounds like a reasonable assumption, should 
> the Row constructor validate that it holds true maybe? (I guess that might be 
> another potentially breaking change though, if someone has as weird code as 
> this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message

2019-02-03 Thread Arttu Voutilainen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759549#comment-16759549
 ] 

Arttu Voutilainen commented on SPARK-26810:
---

Yup, no worries. Now that you understood the case, I'll leave it up to you to 
either close this as duplicate of SPARK-23299 (I think that's the only thing 
here that should be fixed some day), or if you want to keep it open to discuss 
the SPARK-25072 fix or something.

> Fixing SPARK-25072 broke existing code and fails to show error message
> --
>
> Key: SPARK-26810
> URL: https://issues.apache.org/jira/browse/SPARK-26810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Minor
>
> Hey,
> We upgraded Spark recently, and 
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
> after the upgrade. Annoyingly, the error message formatting also threw an 
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
> "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
> return "" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The 
> unexpected error came from __repr__ as it assumes that the arguments given to 
> Row constructor are strings. That sounds like a reasonable assumption, should 
> the Row constructor validate that it holds true maybe? (I guess that might be 
> another potentially breaking change though, if someone has as weird code as 
> this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26572) Join on distinct column with monotonically_increasing_id produces wrong output

2019-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26572:


Assignee: Apache Spark

> Join on distinct column with monotonically_increasing_id produces wrong output
> --
>
> Key: SPARK-26572
> URL: https://issues.apache.org/jira/browse/SPARK-26572
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.4.0
> Environment: Running on Ubuntu 18.04LTS and Intellij 2018.2.5
>Reporter: Sören Reichardt
>Assignee: Apache Spark
>Priority: Major
>
> When joining a table with projected monotonically_increasing_id column after 
> calling distinct with another table the operators do not get executed in the 
> right order. 
> Here is a minimal example:
> {code:java}
> import org.apache.spark.sql.{DataFrame, SparkSession, functions}
> object JoinBug extends App {
>   // Spark session setup
>   val session =  SparkSession.builder().master("local[*]").getOrCreate()
>   import session.sqlContext.implicits._
>   session.sparkContext.setLogLevel("error")
>   // Bug in Spark: "monotonically_increasing_id" is pushed down when it 
> shouldn't be. Push down only happens when the
>   // DF containing the "monotonically_increasing_id" expression is on the 
> left side of the join.
>   val baseTable = Seq((1), (1)).toDF("idx")
>   val distinctWithId = baseTable.distinct.withColumn("id", 
> functions.monotonically_increasing_id())
>   val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx")
>   val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx")
>   monotonicallyOnLeft.show // Wrong
>   monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0
> }
> {code}
> It produces the following output:
> {code:java}
> Wrong:
> +---++
> |idx| id |
> +---++
> | 1|369367187456 |
> | 1|369367187457 |
> +---++
> Right:
> +---++
> |idx| id |
> +---++
> | 1|369367187456 |
> | 1|369367187456 |
> +---++
> {code}
> We assume that the join operator triggers a pushdown of expressions 
> (monotonically_increasing_id in this case) which gets pushed down to be 
> executed before distinct. This produces non-distinct rows with unique id's. 
> However it seems like this behavior only appears if the table with the 
> projected expression is on the left side of the join in Spark 2.2.2 (for 
> version 2.4.0 it fails on both joins).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26572) Join on distinct column with monotonically_increasing_id produces wrong output

2019-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26572:


Assignee: (was: Apache Spark)

> Join on distinct column with monotonically_increasing_id produces wrong output
> --
>
> Key: SPARK-26572
> URL: https://issues.apache.org/jira/browse/SPARK-26572
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.4.0
> Environment: Running on Ubuntu 18.04LTS and Intellij 2018.2.5
>Reporter: Sören Reichardt
>Priority: Major
>
> When joining a table with projected monotonically_increasing_id column after 
> calling distinct with another table the operators do not get executed in the 
> right order. 
> Here is a minimal example:
> {code:java}
> import org.apache.spark.sql.{DataFrame, SparkSession, functions}
> object JoinBug extends App {
>   // Spark session setup
>   val session =  SparkSession.builder().master("local[*]").getOrCreate()
>   import session.sqlContext.implicits._
>   session.sparkContext.setLogLevel("error")
>   // Bug in Spark: "monotonically_increasing_id" is pushed down when it 
> shouldn't be. Push down only happens when the
>   // DF containing the "monotonically_increasing_id" expression is on the 
> left side of the join.
>   val baseTable = Seq((1), (1)).toDF("idx")
>   val distinctWithId = baseTable.distinct.withColumn("id", 
> functions.monotonically_increasing_id())
>   val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx")
>   val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx")
>   monotonicallyOnLeft.show // Wrong
>   monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0
> }
> {code}
> It produces the following output:
> {code:java}
> Wrong:
> +---++
> |idx| id |
> +---++
> | 1|369367187456 |
> | 1|369367187457 |
> +---++
> Right:
> +---++
> |idx| id |
> +---++
> | 1|369367187456 |
> | 1|369367187456 |
> +---++
> {code}
> We assume that the join operator triggers a pushdown of expressions 
> (monotonically_increasing_id in this case) which gets pushed down to be 
> executed before distinct. This produces non-distinct rows with unique id's. 
> However it seems like this behavior only appears if the table with the 
> projected expression is on the left side of the join in Spark 2.2.2 (for 
> version 2.4.0 it fails on both joins).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message

2019-02-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759468#comment-16759468
 ] 

Hyukjin Kwon commented on SPARK-26810:
--

Ah, gotya. Yea, it looks the cuase indeed. Sorry thst i rushed to read.

BTW, I think we should better clearly define what to support and unsupport. 
Given my experience so far, and due to the nature of Python, there are many 
holes.. it would be nicer if we can whitelist what we support(what we 
documented).

> Fixing SPARK-25072 broke existing code and fails to show error message
> --
>
> Key: SPARK-26810
> URL: https://issues.apache.org/jira/browse/SPARK-26810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Minor
>
> Hey,
> We upgraded Spark recently, and 
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
> after the upgrade. Annoyingly, the error message formatting also threw an 
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
> "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
> return "" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The 
> unexpected error came from __repr__ as it assumes that the arguments given to 
> Row constructor are strings. That sounds like a reasonable assumption, should 
> the Row constructor validate that it holds true maybe? (I guess that might be 
> another potentially breaking change though, if someone has as weird code as 
> this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26822) Upgrade the deprecated module 'optparse'

2019-02-03 Thread Neil Chien (JIRA)
Neil Chien created SPARK-26822:
--

 Summary: Upgrade the deprecated module 'optparse'
 Key: SPARK-26822
 URL: https://issues.apache.org/jira/browse/SPARK-26822
 Project: Spark
  Issue Type: Task
  Components: Tests
Affects Versions: 2.4.0
Reporter: Neil Chien


Follow the [official 
document|https://docs.python.org/2/library/argparse.html#upgrading-optparse-code]
 to upgrade the deprecated module 'optparse' to 'argparse'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26822) Upgrade the deprecated module 'optparse'

2019-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26822:


Assignee: Apache Spark

> Upgrade the deprecated module 'optparse'
> 
>
> Key: SPARK-26822
> URL: https://issues.apache.org/jira/browse/SPARK-26822
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Neil Chien
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available, test
>
> Follow the [official 
> document|https://docs.python.org/2/library/argparse.html#upgrading-optparse-code]
>  to upgrade the deprecated module 'optparse' to 'argparse'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26822) Upgrade the deprecated module 'optparse'

2019-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26822:


Assignee: (was: Apache Spark)

> Upgrade the deprecated module 'optparse'
> 
>
> Key: SPARK-26822
> URL: https://issues.apache.org/jira/browse/SPARK-26822
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Neil Chien
>Priority: Minor
>  Labels: pull-request-available, test
>
> Follow the [official 
> document|https://docs.python.org/2/library/argparse.html#upgrading-optparse-code]
>  to upgrade the deprecated module 'optparse' to 'argparse'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26822) Upgrade the deprecated module 'optparse'

2019-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759441#comment-16759441
 ] 

Apache Spark commented on SPARK-26822:
--

User 'cchung100m' has created a pull request for this issue:
https://github.com/apache/spark/pull/23730

> Upgrade the deprecated module 'optparse'
> 
>
> Key: SPARK-26822
> URL: https://issues.apache.org/jira/browse/SPARK-26822
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Neil Chien
>Priority: Minor
>  Labels: pull-request-available, test
>
> Follow the [official 
> document|https://docs.python.org/2/library/argparse.html#upgrading-optparse-code]
>  to upgrade the deprecated module 'optparse' to 'argparse'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26751) HiveSessionImpl might have memory leak since Operation do not close properly

2019-02-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26751:
--
Priority: Minor  (was: Major)

> HiveSessionImpl might have memory leak since Operation do not close properly
> 
>
> Key: SPARK-26751
> URL: https://issues.apache.org/jira/browse/SPARK-26751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Minor
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
> Attachments: 26751.png
>
>
> When we run in background and we get exception which is not HiveSQLException,
> we may encounter memory leak since handleToOperation will not removed 
> correctly.
> The reason is below:
> 1. when calling operation.run we throw an exception which is not 
> HiveSQLException
> 2. then opHandleSet will not add the opHandle, and 
> operationManager.closeOperation(opHandle); will not be called
> {code:java}
>  private OperationHandle executeStatementInternal(String statement, 
> Map confOverlay, boolean runAsync) throws HiveSQLException {
> this.acquire(true);
> OperationManager operationManager = this.getOperationManager();
> ExecuteStatementOperation operation = 
> operationManager.newExecuteStatementOperation(this.getSession(), statement, 
> confOverlay, runAsync);
> OperationHandle opHandle = operation.getHandle();
> OperationHandle e;
> try {
> operation.run();
> this.opHandleSet.add(opHandle);
> e = opHandle;
> } catch (HiveSQLException var11) {
> operationManager.closeOperation(opHandle);
> throw var11;
> } finally {
> this.release(true);
> }
> return e;
> }
>   try {
> // This submit blocks if no background threads are available to run 
> this operation
> val backgroundHandle =
>   
> parentSession.getSessionManager().submitBackgroundOperation(backgroundOperation)
> setBackgroundHandle(backgroundHandle)
>   } catch {
> case rejected: RejectedExecutionException =>
>   setState(OperationState.ERROR)
>   throw new HiveSQLException("The background threadpool cannot 
> accept" +
> " new task for execution, please retry the operation", rejected)
> case NonFatal(e) =>
>   logError(s"Error executing query in background", e)
>   setState(OperationState.ERROR)
>   throw e
>   }
> }
> {code}
> 3. when we close the session we will also call 
> operationManager.closeOperation(opHandle),since we did not add this opHandle 
> into the opHandleSet.
> {code}
> public void close() throws HiveSQLException {
> try {
> this.acquire(true);
> Iterator ioe = this.opHandleSet.iterator();
> while(ioe.hasNext()) {
> OperationHandle opHandle = (OperationHandle)ioe.next();
> this.operationManager.closeOperation(opHandle);
> }
> this.opHandleSet.clear();
> this.cleanupSessionLogDir();
> this.cleanupPipeoutFile();
> HiveHistory ioe1 = this.sessionState.getHiveHistory();
> if(null != ioe1) {
> ioe1.closeStream();
> }
> try {
> this.sessionState.close();
> } finally {
> this.sessionState = null;
> }
> } catch (IOException var17) {
> throw new HiveSQLException("Failure to close", var17);
> } finally {
> if(this.sessionState != null) {
> try {
> this.sessionState.close();
> } catch (Throwable var15) {
> LOG.warn("Error closing session", var15);
> }
> this.sessionState = null;
> }
> this.release(true);
> }
> }
> {code}
> 4. however, the opHandle will added into handleToOperation for each statement
> {code}
> val handleToOperation = ReflectionUtils
> .getSuperField[JMap[OperationHandle, Operation]](this, 
> "handleToOperation")
>   val sessionToActivePool = new ConcurrentHashMap[SessionHandle, String]()
>   val sessionToContexts = new ConcurrentHashMap[SessionHandle, SQLContext]()
>   override def newExecuteStatementOperation(
>   parentSession: HiveSession,
>   statement: String,
>   confOverlay: JMap[String, String],
>   async: Boolean): ExecuteStatementOperation = synchronized {
> val sqlContext = sessionToContexts.get(parentSession.getSessionHandle)
> require(sqlContext != null, s"Session handle: 
> ${parentSession.getSessionHandle} has not been" +
>   

[jira] [Assigned] (SPARK-26751) HiveSessionImpl might have memory leak since Operation do not close properly

2019-02-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26751:
-

Assignee: zhoukang

> HiveSessionImpl might have memory leak since Operation do not close properly
> 
>
> Key: SPARK-26751
> URL: https://issues.apache.org/jira/browse/SPARK-26751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: 26751.png
>
>
> When we run in background and we get exception which is not HiveSQLException,
> we may encounter memory leak since handleToOperation will not removed 
> correctly.
> The reason is below:
> 1. when calling operation.run we throw an exception which is not 
> HiveSQLException
> 2. then opHandleSet will not add the opHandle, and 
> operationManager.closeOperation(opHandle); will not be called
> {code:java}
>  private OperationHandle executeStatementInternal(String statement, 
> Map confOverlay, boolean runAsync) throws HiveSQLException {
> this.acquire(true);
> OperationManager operationManager = this.getOperationManager();
> ExecuteStatementOperation operation = 
> operationManager.newExecuteStatementOperation(this.getSession(), statement, 
> confOverlay, runAsync);
> OperationHandle opHandle = operation.getHandle();
> OperationHandle e;
> try {
> operation.run();
> this.opHandleSet.add(opHandle);
> e = opHandle;
> } catch (HiveSQLException var11) {
> operationManager.closeOperation(opHandle);
> throw var11;
> } finally {
> this.release(true);
> }
> return e;
> }
>   try {
> // This submit blocks if no background threads are available to run 
> this operation
> val backgroundHandle =
>   
> parentSession.getSessionManager().submitBackgroundOperation(backgroundOperation)
> setBackgroundHandle(backgroundHandle)
>   } catch {
> case rejected: RejectedExecutionException =>
>   setState(OperationState.ERROR)
>   throw new HiveSQLException("The background threadpool cannot 
> accept" +
> " new task for execution, please retry the operation", rejected)
> case NonFatal(e) =>
>   logError(s"Error executing query in background", e)
>   setState(OperationState.ERROR)
>   throw e
>   }
> }
> {code}
> 3. when we close the session we will also call 
> operationManager.closeOperation(opHandle),since we did not add this opHandle 
> into the opHandleSet.
> {code}
> public void close() throws HiveSQLException {
> try {
> this.acquire(true);
> Iterator ioe = this.opHandleSet.iterator();
> while(ioe.hasNext()) {
> OperationHandle opHandle = (OperationHandle)ioe.next();
> this.operationManager.closeOperation(opHandle);
> }
> this.opHandleSet.clear();
> this.cleanupSessionLogDir();
> this.cleanupPipeoutFile();
> HiveHistory ioe1 = this.sessionState.getHiveHistory();
> if(null != ioe1) {
> ioe1.closeStream();
> }
> try {
> this.sessionState.close();
> } finally {
> this.sessionState = null;
> }
> } catch (IOException var17) {
> throw new HiveSQLException("Failure to close", var17);
> } finally {
> if(this.sessionState != null) {
> try {
> this.sessionState.close();
> } catch (Throwable var15) {
> LOG.warn("Error closing session", var15);
> }
> this.sessionState = null;
> }
> this.release(true);
> }
> }
> {code}
> 4. however, the opHandle will added into handleToOperation for each statement
> {code}
> val handleToOperation = ReflectionUtils
> .getSuperField[JMap[OperationHandle, Operation]](this, 
> "handleToOperation")
>   val sessionToActivePool = new ConcurrentHashMap[SessionHandle, String]()
>   val sessionToContexts = new ConcurrentHashMap[SessionHandle, SQLContext]()
>   override def newExecuteStatementOperation(
>   parentSession: HiveSession,
>   statement: String,
>   confOverlay: JMap[String, String],
>   async: Boolean): ExecuteStatementOperation = synchronized {
> val sqlContext = sessionToContexts.get(parentSession.getSessionHandle)
> require(sqlContext != null, s"Session handle: 
> ${parentSession.getSessionHandle} has not been" +
>   s" initialized or had already closed.")
> val 

[jira] [Resolved] (SPARK-26751) HiveSessionImpl might have memory leak since Operation do not close properly

2019-02-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26751.
---
   Resolution: Fixed
Fix Version/s: 2.3.4
   2.4.1
   3.0.0

Issue resolved by pull request 23673
[https://github.com/apache/spark/pull/23673]

> HiveSessionImpl might have memory leak since Operation do not close properly
> 
>
> Key: SPARK-26751
> URL: https://issues.apache.org/jira/browse/SPARK-26751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 3.0.0, 2.4.1, 2.3.4
>
> Attachments: 26751.png
>
>
> When we run in background and we get exception which is not HiveSQLException,
> we may encounter memory leak since handleToOperation will not removed 
> correctly.
> The reason is below:
> 1. when calling operation.run we throw an exception which is not 
> HiveSQLException
> 2. then opHandleSet will not add the opHandle, and 
> operationManager.closeOperation(opHandle); will not be called
> {code:java}
>  private OperationHandle executeStatementInternal(String statement, 
> Map confOverlay, boolean runAsync) throws HiveSQLException {
> this.acquire(true);
> OperationManager operationManager = this.getOperationManager();
> ExecuteStatementOperation operation = 
> operationManager.newExecuteStatementOperation(this.getSession(), statement, 
> confOverlay, runAsync);
> OperationHandle opHandle = operation.getHandle();
> OperationHandle e;
> try {
> operation.run();
> this.opHandleSet.add(opHandle);
> e = opHandle;
> } catch (HiveSQLException var11) {
> operationManager.closeOperation(opHandle);
> throw var11;
> } finally {
> this.release(true);
> }
> return e;
> }
>   try {
> // This submit blocks if no background threads are available to run 
> this operation
> val backgroundHandle =
>   
> parentSession.getSessionManager().submitBackgroundOperation(backgroundOperation)
> setBackgroundHandle(backgroundHandle)
>   } catch {
> case rejected: RejectedExecutionException =>
>   setState(OperationState.ERROR)
>   throw new HiveSQLException("The background threadpool cannot 
> accept" +
> " new task for execution, please retry the operation", rejected)
> case NonFatal(e) =>
>   logError(s"Error executing query in background", e)
>   setState(OperationState.ERROR)
>   throw e
>   }
> }
> {code}
> 3. when we close the session we will also call 
> operationManager.closeOperation(opHandle),since we did not add this opHandle 
> into the opHandleSet.
> {code}
> public void close() throws HiveSQLException {
> try {
> this.acquire(true);
> Iterator ioe = this.opHandleSet.iterator();
> while(ioe.hasNext()) {
> OperationHandle opHandle = (OperationHandle)ioe.next();
> this.operationManager.closeOperation(opHandle);
> }
> this.opHandleSet.clear();
> this.cleanupSessionLogDir();
> this.cleanupPipeoutFile();
> HiveHistory ioe1 = this.sessionState.getHiveHistory();
> if(null != ioe1) {
> ioe1.closeStream();
> }
> try {
> this.sessionState.close();
> } finally {
> this.sessionState = null;
> }
> } catch (IOException var17) {
> throw new HiveSQLException("Failure to close", var17);
> } finally {
> if(this.sessionState != null) {
> try {
> this.sessionState.close();
> } catch (Throwable var15) {
> LOG.warn("Error closing session", var15);
> }
> this.sessionState = null;
> }
> this.release(true);
> }
> }
> {code}
> 4. however, the opHandle will added into handleToOperation for each statement
> {code}
> val handleToOperation = ReflectionUtils
> .getSuperField[JMap[OperationHandle, Operation]](this, 
> "handleToOperation")
>   val sessionToActivePool = new ConcurrentHashMap[SessionHandle, String]()
>   val sessionToContexts = new ConcurrentHashMap[SessionHandle, SQLContext]()
>   override def newExecuteStatementOperation(
>   parentSession: HiveSession,
>   statement: String,
>   confOverlay: JMap[String, String],
>   async: Boolean): ExecuteStatementOperation = synchronized {
> val sqlContext = 

[jira] [Commented] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message

2019-02-03 Thread Arttu Voutilainen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759423#comment-16759423
 ] 

Arttu Voutilainen commented on SPARK-26810:
---

[~hyukjin.kwon] thanks for checking this! You don't see the error message of 
SPARK-25072 exactly because of that other issue - while formatting the error 
message it throws because of SPARK-23299. 

I have no clue why someone had written the Row(['a', 'b']) originally, I don't 
think it is documented. Still, it used to work before the SPARK-25072 fix, and 
SPARK-23299 means it wasn't easy to understand why (as it hides the real error 
message).

> Fixing SPARK-25072 broke existing code and fails to show error message
> --
>
> Key: SPARK-26810
> URL: https://issues.apache.org/jira/browse/SPARK-26810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Minor
>
> Hey,
> We upgraded Spark recently, and 
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
> after the upgrade. Annoyingly, the error message formatting also threw an 
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
> "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
> return "" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The 
> unexpected error came from __repr__ as it assumes that the arguments given to 
> Row constructor are strings. That sounds like a reasonable assumption, should 
> the Row constructor validate that it holds true maybe? (I guess that might be 
> another potentially breaking change though, if someone has as weird code as 
> this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26818) Make MLEvents JSON ser/de safe

2019-02-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26818.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23728
[https://github.com/apache/spark/pull/23728]

> Make MLEvents JSON ser/de safe
> --
>
> Key: SPARK-26818
> URL: https://issues.apache.org/jira/browse/SPARK-26818
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Looks ML events are not JSON serializable. We can make it serialisable like:
> {code}
> @DeveloperApi
> case class SparkListenerSQLExecutionEnd(executionId: Long, time: Long)
>   extends SparkListenerEvent {
>   // The name of the execution, e.g. `df.collect` will trigger a SQL 
> execution with name "collect".
>   @JsonIgnore private[sql] var executionName: Option[String] = None
>   // The following 3 fields are only accessed when `executionName` is defined.
>   // The duration of the SQL execution, in nanoseconds.
>   @JsonIgnore private[sql] var duration: Long = 0L
>   // The `QueryExecution` instance that represents the SQL execution
>   @JsonIgnore private[sql] var qe: QueryExecution = null
>   // The exception object that caused this execution to fail. None if the 
> execution doesn't fail.
>   @JsonIgnore private[sql] var executionFailure: Option[Exception] = None
> }
> {code}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26821) filters not working with char datatype when querying against hive table

2019-02-03 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759381#comment-16759381
 ] 

Sujith commented on SPARK-26821:


As per the initial analysis, this phenomenon is happening because the actual 
char data type length is 5 where as we are trying to insert a data with length 
2, since its a char data type the system will pad the remaining part of the 
array block with 'space'. now when we try to apply a filter, the system will 
try to compare the predicate value with the actual table data which contains 
the space char like 'ds' ==  'ds   ' which leads to wrong result.

 

I am trying to analyze more on this issue please let me know for any 
suggestions or guidance. thanks

 

> filters not working with char datatype when querying against hive table
> ---
>
> Key: SPARK-26821
> URL: https://issues.apache.org/jira/browse/SPARK-26821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sujith
>Priority: Major
>
> creates a table with a char type field, While inserting data to char data 
> type column, if the data string length is less than the specified datatype 
> length, spark2x will not process filter query properly leading to incorrect 
> result .
> 0: jdbc:hive2://10.19.89.222:22550/default> create table jj(id int, name 
> char(5));
>  +--+-+
> |Result|
> +--+-+
>  +--+-+
>  No rows selected (0.894 seconds)
>  0: jdbc:hive2://10.19.89.222:22550/default> insert into table jj 
> values(232,'ds');
>  +--+-+
> |Result|
> +--+-+
>  +--+-+
>  No rows selected (1.815 seconds)
>  0: jdbc:hive2://10.19.89.222:22550/default> select * from jj where name='ds';
>  +--+--++--
> |id|name|
> +--+--++--
>  +--+--++--
>  
> The above query will not give any result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26821) filters not working with char datatype when querying against hive table

2019-02-03 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759382#comment-16759382
 ] 

Sujith commented on SPARK-26821:


cc [~dongjoon]

> filters not working with char datatype when querying against hive table
> ---
>
> Key: SPARK-26821
> URL: https://issues.apache.org/jira/browse/SPARK-26821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sujith
>Priority: Major
>
> creates a table with a char type field, While inserting data to char data 
> type column, if the data string length is less than the specified datatype 
> length, spark2x will not process filter query properly leading to incorrect 
> result .
> 0: jdbc:hive2://10.19.89.222:22550/default> create table jj(id int, name 
> char(5));
>  +--+-+
> |Result|
> +--+-+
>  +--+-+
>  No rows selected (0.894 seconds)
>  0: jdbc:hive2://10.19.89.222:22550/default> insert into table jj 
> values(232,'ds');
>  +--+-+
> |Result|
> +--+-+
>  +--+-+
>  No rows selected (1.815 seconds)
>  0: jdbc:hive2://10.19.89.222:22550/default> select * from jj where name='ds';
>  +--+--++--
> |id|name|
> +--+--++--
>  +--+--++--
>  
> The above query will not give any result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26821) filters not working with char datatype when querying against hive table

2019-02-03 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759382#comment-16759382
 ] 

Sujith edited comment on SPARK-26821 at 2/3/19 11:44 AM:
-

cc [~dongjoon] [~vinodkc]


was (Author: s71955):
cc [~dongjoon]

> filters not working with char datatype when querying against hive table
> ---
>
> Key: SPARK-26821
> URL: https://issues.apache.org/jira/browse/SPARK-26821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sujith
>Priority: Major
>
> creates a table with a char type field, While inserting data to char data 
> type column, if the data string length is less than the specified datatype 
> length, spark2x will not process filter query properly leading to incorrect 
> result .
> 0: jdbc:hive2://10.19.89.222:22550/default> create table jj(id int, name 
> char(5));
>  +--+-+
> |Result|
> +--+-+
>  +--+-+
>  No rows selected (0.894 seconds)
>  0: jdbc:hive2://10.19.89.222:22550/default> insert into table jj 
> values(232,'ds');
>  +--+-+
> |Result|
> +--+-+
>  +--+-+
>  No rows selected (1.815 seconds)
>  0: jdbc:hive2://10.19.89.222:22550/default> select * from jj where name='ds';
>  +--+--++--
> |id|name|
> +--+--++--
>  +--+--++--
>  
> The above query will not give any result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26821) filters not working with char datatype when querying against hive table

2019-02-03 Thread Sujith (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-26821:
---
Description: 
creates a table with a char type field, While inserting data to char data type 
column, if the data string length is less than the specified datatype length, 
spark2x will not process filter query properly leading to incorrect result .

0: jdbc:hive2://10.19.89.222:22550/default> create table jj(id int, name 
char(5));
 +--+-+
|Result|

+--+-+
 +--+-+
 No rows selected (0.894 seconds)
 0: jdbc:hive2://10.19.89.222:22550/default> insert into table jj 
values(232,'ds');
 +--+-+
|Result|

+--+-+
 +--+-+
 No rows selected (1.815 seconds)
 0: jdbc:hive2://10.19.89.222:22550/default> select * from jj where name='ds';
 +--+--++--
|id|name|

+--+--++--
 +--+--++--

 

The above query will not give any result.

  was:
creates a table with a char type field, While inserting data to char data type 
column, if the data string length  is less than the specified datatype length, 
spark2x will not process filter query properly leading to incorrect result .

0: jdbc:hive2://10.19.89.222:22550/default> create table jj(id int, name 
char(5));
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.894 seconds)
0: jdbc:hive2://10.19.89.222:22550/default> insert into table jj 
values(232,'ds');
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (1.815 seconds)
0: jdbc:hive2://10.19.89.222:22550/default> select * from jj where name='ds';
+-+---+--+
| id  | name  |
+-+---+--+
+-+---+--+


> filters not working with char datatype when querying against hive table
> ---
>
> Key: SPARK-26821
> URL: https://issues.apache.org/jira/browse/SPARK-26821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sujith
>Priority: Major
>
> creates a table with a char type field, While inserting data to char data 
> type column, if the data string length is less than the specified datatype 
> length, spark2x will not process filter query properly leading to incorrect 
> result .
> 0: jdbc:hive2://10.19.89.222:22550/default> create table jj(id int, name 
> char(5));
>  +--+-+
> |Result|
> +--+-+
>  +--+-+
>  No rows selected (0.894 seconds)
>  0: jdbc:hive2://10.19.89.222:22550/default> insert into table jj 
> values(232,'ds');
>  +--+-+
> |Result|
> +--+-+
>  +--+-+
>  No rows selected (1.815 seconds)
>  0: jdbc:hive2://10.19.89.222:22550/default> select * from jj where name='ds';
>  +--+--++--
> |id|name|
> +--+--++--
>  +--+--++--
>  
> The above query will not give any result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26821) filters not working with char datatype when querying against hive table

2019-02-03 Thread Sujith (JIRA)
Sujith created SPARK-26821:
--

 Summary: filters not working with char datatype when querying 
against hive table
 Key: SPARK-26821
 URL: https://issues.apache.org/jira/browse/SPARK-26821
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Sujith


creates a table with a char type field, While inserting data to char data type 
column, if the data string length  is less than the specified datatype length, 
spark2x will not process filter query properly leading to incorrect result .

0: jdbc:hive2://10.19.89.222:22550/default> create table jj(id int, name 
char(5));
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.894 seconds)
0: jdbc:hive2://10.19.89.222:22550/default> insert into table jj 
values(232,'ds');
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (1.815 seconds)
0: jdbc:hive2://10.19.89.222:22550/default> select * from jj where name='ds';
+-+---+--+
| id  | name  |
+-+---+--+
+-+---+--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26820) Issue Error/Warning when Hint is not applicable

2019-02-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-26820:
---

 Summary: Issue Error/Warning when Hint is not applicable
 Key: SPARK-26820
 URL: https://issues.apache.org/jira/browse/SPARK-26820
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li


We should issue an error or a warning when the HINT is not applicable. This 
should be configurable. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org