[jira] [Created] (SPARK-32724) java.io.IOException: Stream is corrupted when tried to inner join 4 huge tables. Currently using pyspark version 2.4.0-cdh6.3.1

2020-08-28 Thread Kannan (Jira)
Kannan created SPARK-32724:
--

 Summary: java.io.IOException: Stream is corrupted when tried to 
inner join 4 huge tables. Currently using  pyspark version 2.4.0-cdh6.3.1 
 Key: SPARK-32724
 URL: https://issues.apache.org/jira/browse/SPARK-32724
 Project: Spark
  Issue Type: Question
  Components: PySpark, Spark Core
Affects Versions: 2.4.0
Reporter: Kannan


When i try to join the 4 tables with 1M data i am getting below error.

Py4JJavaError: An error occurred while calling o453.count. : 
org.apache.spark.SparkException: Job aborted due to stage failure: Aborting 
TaskSet 27.0 because task 9 (partition 9) cannot run anywhere due to node and 
executor blacklist. Most recent failure: Lost task 9.1 in stage 27.0 (TID 267, 
si-159l.de.des.com, executor 17): java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Blacklisting behavior can be 
configured via spark.blacklist.*. at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1877) at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:929)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:929)
 at scala.Option.foreach(Option.scala:257) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:929)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2111)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2060)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2049)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:740) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2081) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2102) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2121) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2146) at 
org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at 
org.apache.spark.rdd.RDD.collect(RDD.scala:944) at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299) at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2830) at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2829) at 
org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364) at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363) at 
org.apache.spark.sql.Dataset.count(Dataset.scala:2829) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
py4j.Gateway.invoke(Gateway.java:282) at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
py4j.commands.CallCommand.execute(CallCommand.java:79) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (SPARK-32725) Getting jar vulnerability for kryo-shaded for 4.0 version

2020-08-28 Thread Wasi (Jira)
Wasi created SPARK-32725:


 Summary: Getting jar vulnerability for kryo-shaded for 4.0 version
 Key: SPARK-32725
 URL: https://issues.apache.org/jira/browse/SPARK-32725
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Wasi


We are getting jar vulnerability for kryo 4.0. Report suggesting to upgrade to 
5.x and 5.x doesn't have backward compatibility. By doing upgrade we might need 
to change org.apache.spark.serializer.kryoserializer files and corresponding 
instance file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32723) Security Vulnerability due to JQuery version in Spark Master/Worker UI

2020-08-28 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186403#comment-17186403
 ] 

Rohit Mishra commented on SPARK-32723:
--

[~ashish23aks], Please refrain from marking the target version. This is 
reserved for committers. 

> Security Vulnerability due to JQuery version in Spark Master/Worker UI
> --
>
> Key: SPARK-32723
> URL: https://issues.apache.org/jira/browse/SPARK-32723
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Ashish Kumar Singh
>Priority: Major
>  Labels: Security
>
> Spark 3.0, Spark 2.4.x uses JQuery version < 3.5 which has known security 
> vulnerability in Spark Master UI and Spark Worker UI.
> Can we please upgrade JQuery to 3.5 and above ?
>  [https://www.tenable.com/plugins/nessus/136929]
> ??According to the self-reported version in the script, the version of JQuery 
> hosted on the remote web server is greater than or equal to 1.2 and prior to 
> 3.5.0. It is, therefore, affected by multiple cross site scripting 
> vulnerabilities.??
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32726) Filter by column alias in where clause

2020-08-28 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-32726:
---

 Summary: Filter by column alias in where clause
 Key: SPARK-32726
 URL: https://issues.apache.org/jira/browse/SPARK-32726
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


{{group by}} and {{order by}} clause support it. but {{where}} does not support 
it:

{noformat}
spark-sql> select id + 1 as new_id from range(5) group by new_id order by 
new_id;
1
2
3
4
5
spark-sql> select id + 1 as new_id from range(5) where new_id > 2;
Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 pos 
44;
'Project [('id + 1) AS new_id#5]
+- 'Filter ('new_id > 2)
   +- Range (0, 5, step=1, splits=None
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32723) Security Vulnerability due to JQuery version in Spark Master/Worker UI

2020-08-28 Thread Rohit Mishra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Mishra updated SPARK-32723:
-
Target Version/s:   (was: 3.1.0)

> Security Vulnerability due to JQuery version in Spark Master/Worker UI
> --
>
> Key: SPARK-32723
> URL: https://issues.apache.org/jira/browse/SPARK-32723
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Ashish Kumar Singh
>Priority: Major
>  Labels: Security
>
> Spark 3.0, Spark 2.4.x uses JQuery version < 3.5 which has known security 
> vulnerability in Spark Master UI and Spark Worker UI.
> Can we please upgrade JQuery to 3.5 and above ?
>  [https://www.tenable.com/plugins/nessus/136929]
> ??According to the self-reported version in the script, the version of JQuery 
> hosted on the remote web server is greater than or equal to 1.2 and prior to 
> 3.5.0. It is, therefore, affected by multiple cross site scripting 
> vulnerabilities.??
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32727) replace CaseWhen with If when there is only one case

2020-08-28 Thread Tanel Kiis (Jira)
Tanel Kiis created SPARK-32727:
--

 Summary: replace CaseWhen with If when there is only one case
 Key: SPARK-32727
 URL: https://issues.apache.org/jira/browse/SPARK-32727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Tanel Kiis






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32727) replace CaseWhen with If when there is only one case

2020-08-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32727:


Assignee: (was: Apache Spark)

> replace CaseWhen with If when there is only one case
> 
>
> Key: SPARK-32727
> URL: https://issues.apache.org/jira/browse/SPARK-32727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32727) replace CaseWhen with If when there is only one case

2020-08-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186446#comment-17186446
 ] 

Apache Spark commented on SPARK-32727:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29570

> replace CaseWhen with If when there is only one case
> 
>
> Key: SPARK-32727
> URL: https://issues.apache.org/jira/browse/SPARK-32727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32726) Filter by column alias in where clause

2020-08-28 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32726:

Description: 
{{group by}} and {{order by}} clause support it. but {{where}} does not support 
it:

{noformat}
spark-sql> select id + 1 as new_id from range(5) group by new_id order by 
new_id;
1
2
3
4
5
spark-sql> select id + 1 as new_id from range(5) where new_id > 2;
Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 pos 
44;
'Project [('id + 1) AS new_id#5]
+- 'Filter ('new_id > 2)
   +- Range (0, 5, step=1, splits=None
spark-sql> select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 pos 
25;
'Project [(id#12L + cast(1 as bigint)) AS new_id#10L, ('new_id + 1) AS 
new_new_id#11]
+- Range (0, 5, step=1, splits=None)
{noformat}


  was:
{{group by}} and {{order by}} clause support it. but {{where}} does not support 
it:

{noformat}
spark-sql> select id + 1 as new_id from range(5) group by new_id order by 
new_id;
1
2
3
4
5
spark-sql> select id + 1 as new_id from range(5) where new_id > 2;
Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 pos 
44;
'Project [('id + 1) AS new_id#5]
+- 'Filter ('new_id > 2)
   +- Range (0, 5, step=1, splits=None
{noformat}



> Filter by column alias in where clause
> --
>
> Key: SPARK-32726
> URL: https://issues.apache.org/jira/browse/SPARK-32726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {{group by}} and {{order by}} clause support it. but {{where}} does not 
> support it:
> {noformat}
> spark-sql> select id + 1 as new_id from range(5) group by new_id order by 
> new_id;
> 1
> 2
> 3
> 4
> 5
> spark-sql> select id + 1 as new_id from range(5) where new_id > 2;
> Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 
> pos 44;
> 'Project [('id + 1) AS new_id#5]
> +- 'Filter ('new_id > 2)
>+- Range (0, 5, step=1, splits=None
> spark-sql> select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
> Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 
> pos 25;
> 'Project [(id#12L + cast(1 as bigint)) AS new_id#10L, ('new_id + 1) AS 
> new_new_id#11]
> +- Range (0, 5, step=1, splits=None)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32727) replace CaseWhen with If when there is only one case

2020-08-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32727:


Assignee: Apache Spark

> replace CaseWhen with If when there is only one case
> 
>
> Key: SPARK-32727
> URL: https://issues.apache.org/jira/browse/SPARK-32727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32717) Add a AQEOptimizer for AdaptiveSparkPlanExec

2020-08-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32717.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29559
[https://github.com/apache/spark/pull/29559]

> Add a AQEOptimizer for AdaptiveSparkPlanExec
> 
>
> Key: SPARK-32717
> URL: https://issues.apache.org/jira/browse/SPARK-32717
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, AdaptiveSparkPlanExec has implemented an anonymous RuleExecutor to 
> apply the AQE specific rules on the plan. However, the anonymous class 
> usually could be inconvenient to maintain and extend for the long term.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32717) Add a AQEOptimizer for AdaptiveSparkPlanExec

2020-08-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32717:


Assignee: wuyi

> Add a AQEOptimizer for AdaptiveSparkPlanExec
> 
>
> Key: SPARK-32717
> URL: https://issues.apache.org/jira/browse/SPARK-32717
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Currently, AdaptiveSparkPlanExec has implemented an anonymous RuleExecutor to 
> apply the AQE specific rules on the plan. However, the anonymous class 
> usually could be inconvenient to maintain and extend for the long term.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property

2020-08-28 Thread david bernuau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186517#comment-17186517
 ] 

david bernuau commented on SPARK-32693:
---

thanks

> Compare two dataframes with same schema except nullable property
> 
>
> Key: SPARK-32693
> URL: https://issues.apache.org/jira/browse/SPARK-32693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.0
>Reporter: david bernuau
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.0
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  
>  
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  
>  
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32728) Using groupby with rand creates different values when joining table with itself

2020-08-28 Thread Joachim Bargsten (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joachim Bargsten updated SPARK-32728:
-
Description: 
When running following query on a cluster with *multiple workers (>1)*, the 
result is not 0.0, even though I would expect it to be.
{code:java}
import pyspark.sql.functions as F
sdf = spark.range(100)
sdf = (
sdf.withColumn("a", F.col("id") + 1)
.withColumn("b", F.col("id") + 2)
.withColumn("c", F.col("id") + 3)
.withColumn("d", F.col("id") + 4)
.withColumn("e", F.col("id") + 5)
)
sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
sdf = sdf.withColumn("x", F.rand() * F.col("e"))
sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
F.col("xx"))).agg(F.sum("delta_x"))
sum_delta_x = sdf2.head()[0]
print(f"{sum_delta_x} should be 0.0")
assert abs(sum_delta_x) < 0.001
{code}
If the groupby statement is commented out, the code is working as expected.

  was:
When running following query on a cluster with *multiple workers (>1)*, the 
result is not 0.0, even though I would expect it to be.
{code:java}
import pyspark.sql.functions as F
sdf = spark.range(100)
sdf = (
sdf.withColumn("a", F.col("id") + 1)
.withColumn("b", F.col("id") + 2)
.withColumn("c", F.col("id") + 3)
.withColumn("d", F.col("id") + 4)
.withColumn("e", F.col("id") + 5)
)
sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
sdf = sdf.withColumn("x", F.rand() * F.col("e"))
sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
F.col("xx"))).agg(F.sum("delta_x"))
sum_delta_x = sdf2.head()[0]
print(f"{sum_delta_x} should be 0.0")
assert abs(sum_delta_x) < 0.001
{code}
{{}}If the groupby statement is commented out, the code is working as expected.


> Using groupby with rand creates different values when joining table with 
> itself
> ---
>
> Key: SPARK-32728
> URL: https://issues.apache.org/jira/browse/SPARK-32728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
> Environment: I tested it with  environment,Azure Databricks 7.2 
> (includes Apache Spark 3.0.0, Scala 2.12)
> Worker type: Standard_DS3_v2 (2 workers)
>  
>Reporter: Joachim Bargsten
>Priority: Minor
>
> When running following query on a cluster with *multiple workers (>1)*, the 
> result is not 0.0, even though I would expect it to be.
> {code:java}
> import pyspark.sql.functions as F
> sdf = spark.range(100)
> sdf = (
> sdf.withColumn("a", F.col("id") + 1)
> .withColumn("b", F.col("id") + 2)
> .withColumn("c", F.col("id") + 3)
> .withColumn("d", F.col("id") + 4)
> .withColumn("e", F.col("id") + 5)
> )
> sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
> sdf = sdf.withColumn("x", F.rand() * F.col("e"))
> sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
> sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
> F.col("xx"))).agg(F.sum("delta_x"))
> sum_delta_x = sdf2.head()[0]
> print(f"{sum_delta_x} should be 0.0")
> assert abs(sum_delta_x) < 0.001
> {code}
> If the groupby statement is commented out, the code is working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32728) Using groupby with rand creates different values when joining table with itself

2020-08-28 Thread Joachim Bargsten (Jira)
Joachim Bargsten created SPARK-32728:


 Summary: Using groupby with rand creates different values when 
joining table with itself
 Key: SPARK-32728
 URL: https://issues.apache.org/jira/browse/SPARK-32728
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 2.4.5
 Environment: I tested it with  environment,Azure Databricks 7.2 
(includes Apache Spark 3.0.0, Scala 2.12)

Worker type: Standard_DS3_v2 (2 workers)

 
Reporter: Joachim Bargsten


When running following query on a cluster with *multiple workers (>1)*, the 
result is not 0.0, even though I would expect it to be.
{code:java}
import pyspark.sql.functions as F
sdf = spark.range(100)
sdf = (
sdf.withColumn("a", F.col("id") + 1)
.withColumn("b", F.col("id") + 2)
.withColumn("c", F.col("id") + 3)
.withColumn("d", F.col("id") + 4)
.withColumn("e", F.col("id") + 5)
)
sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
sdf = sdf.withColumn("x", F.rand() * F.col("e"))
sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
F.col("xx"))).agg(F.sum("delta_x"))
sum_delta_x = sdf2.head()[0]
print(f"{sum_delta_x} should be 0.0")
assert abs(sum_delta_x) < 0.001
{code}
{{}}If the groupby statement is commented out, the code is working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32728) Using groupby with rand creates different values when joining table with itself

2020-08-28 Thread Joachim Bargsten (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joachim Bargsten updated SPARK-32728:
-
Environment: 
I tested it with  environment,Azure Databricks 7.2 (& 6.6) (includes Apache 
Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11))

Worker type: Standard_DS3_v2 (2 workers)

 

  was:
I tested it with  environment,Azure Databricks 7.2 (includes Apache Spark 
3.0.0, Scala 2.12)

Worker type: Standard_DS3_v2 (2 workers)

 


> Using groupby with rand creates different values when joining table with 
> itself
> ---
>
> Key: SPARK-32728
> URL: https://issues.apache.org/jira/browse/SPARK-32728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
> Environment: I tested it with  environment,Azure Databricks 7.2 (& 
> 6.6) (includes Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11))
> Worker type: Standard_DS3_v2 (2 workers)
>  
>Reporter: Joachim Bargsten
>Priority: Minor
>
> When running following query on a cluster with *multiple workers (>1)*, the 
> result is not 0.0, even though I would expect it to be.
> {code:java}
> import pyspark.sql.functions as F
> sdf = spark.range(100)
> sdf = (
> sdf.withColumn("a", F.col("id") + 1)
> .withColumn("b", F.col("id") + 2)
> .withColumn("c", F.col("id") + 3)
> .withColumn("d", F.col("id") + 4)
> .withColumn("e", F.col("id") + 5)
> )
> sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
> sdf = sdf.withColumn("x", F.rand() * F.col("e"))
> sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
> sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
> F.col("xx"))).agg(F.sum("delta_x"))
> sum_delta_x = sdf2.head()[0]
> print(f"{sum_delta_x} should be 0.0")
> assert abs(sum_delta_x) < 0.001
> {code}
> If the groupby statement is commented out, the code is working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32728) Using groupby with rand creates different values when joining table with itself

2020-08-28 Thread Joachim Bargsten (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joachim Bargsten updated SPARK-32728:
-
Environment: 
I tested it with Azure Databricks 7.2 (& 6.6) (includes Apache Spark 3.0.0 (& 
2.4.5), Scala 2.12 (& 2.11))

Worker type: Standard_DS3_v2 (2 workers)

 

  was:
I tested it with  environment,Azure Databricks 7.2 (& 6.6) (includes Apache 
Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11))

Worker type: Standard_DS3_v2 (2 workers)

 


> Using groupby with rand creates different values when joining table with 
> itself
> ---
>
> Key: SPARK-32728
> URL: https://issues.apache.org/jira/browse/SPARK-32728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
> Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes 
> Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11))
> Worker type: Standard_DS3_v2 (2 workers)
>  
>Reporter: Joachim Bargsten
>Priority: Minor
>
> When running following query on a cluster with *multiple workers (>1)*, the 
> result is not 0.0, even though I would expect it to be.
> {code:java}
> import pyspark.sql.functions as F
> sdf = spark.range(100)
> sdf = (
> sdf.withColumn("a", F.col("id") + 1)
> .withColumn("b", F.col("id") + 2)
> .withColumn("c", F.col("id") + 3)
> .withColumn("d", F.col("id") + 4)
> .withColumn("e", F.col("id") + 5)
> )
> sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
> sdf = sdf.withColumn("x", F.rand() * F.col("e"))
> sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
> sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
> F.col("xx"))).agg(F.sum("delta_x"))
> sum_delta_x = sdf2.head()[0]
> print(f"{sum_delta_x} should be 0.0")
> assert abs(sum_delta_x) < 0.001
> {code}
> If the groupby statement is commented out, the code is working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32728) Using groupby with rand creates different values when joining table with itself

2020-08-28 Thread Joachim Bargsten (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joachim Bargsten updated SPARK-32728:
-
Description: 
When running following query in a python3 notebook on a cluster with *multiple 
workers (>1)*, the result is not 0.0, even though I would expect it to be.
{code:java}
import pyspark.sql.functions as F
sdf = spark.range(100)
sdf = (
sdf.withColumn("a", F.col("id") + 1)
.withColumn("b", F.col("id") + 2)
.withColumn("c", F.col("id") + 3)
.withColumn("d", F.col("id") + 4)
.withColumn("e", F.col("id") + 5)
)
sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
sdf = sdf.withColumn("x", F.rand() * F.col("e"))
sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
F.col("xx"))).agg(F.sum("delta_x"))
sum_delta_x = sdf2.head()[0]
print(f"{sum_delta_x} should be 0.0")
assert abs(sum_delta_x) < 0.001
{code}
If the groupby statement is commented out, the code is working as expected.

  was:
When running following query on a cluster with *multiple workers (>1)*, the 
result is not 0.0, even though I would expect it to be.
{code:java}
import pyspark.sql.functions as F
sdf = spark.range(100)
sdf = (
sdf.withColumn("a", F.col("id") + 1)
.withColumn("b", F.col("id") + 2)
.withColumn("c", F.col("id") + 3)
.withColumn("d", F.col("id") + 4)
.withColumn("e", F.col("id") + 5)
)
sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
sdf = sdf.withColumn("x", F.rand() * F.col("e"))
sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
F.col("xx"))).agg(F.sum("delta_x"))
sum_delta_x = sdf2.head()[0]
print(f"{sum_delta_x} should be 0.0")
assert abs(sum_delta_x) < 0.001
{code}
If the groupby statement is commented out, the code is working as expected.


> Using groupby with rand creates different values when joining table with 
> itself
> ---
>
> Key: SPARK-32728
> URL: https://issues.apache.org/jira/browse/SPARK-32728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
> Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes 
> Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11))
> Worker type: Standard_DS3_v2 (2 workers)
>  
>Reporter: Joachim Bargsten
>Priority: Minor
>
> When running following query in a python3 notebook on a cluster with 
> *multiple workers (>1)*, the result is not 0.0, even though I would expect it 
> to be.
> {code:java}
> import pyspark.sql.functions as F
> sdf = spark.range(100)
> sdf = (
> sdf.withColumn("a", F.col("id") + 1)
> .withColumn("b", F.col("id") + 2)
> .withColumn("c", F.col("id") + 3)
> .withColumn("d", F.col("id") + 4)
> .withColumn("e", F.col("id") + 5)
> )
> sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
> sdf = sdf.withColumn("x", F.rand() * F.col("e"))
> sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
> sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
> F.col("xx"))).agg(F.sum("delta_x"))
> sum_delta_x = sdf2.head()[0]
> print(f"{sum_delta_x} should be 0.0")
> assert abs(sum_delta_x) < 0.001
> {code}
> If the groupby statement is commented out, the code is working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32729) Fill up missing since version for math expressions

2020-08-28 Thread Kent Yao (Jira)
Kent Yao created SPARK-32729:


 Summary: Fill up missing since version for math expressions
 Key: SPARK-32729
 URL: https://issues.apache.org/jira/browse/SPARK-32729
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


the mark of since version is absent for more than 40 math expressions,  which 
were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc

we should fill them up properly and may visit other groups of functions latter 
for the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32729) Fill up missing since version for math expressions

2020-08-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32729:


Assignee: (was: Apache Spark)

> Fill up missing since version for math expressions
> --
>
> Key: SPARK-32729
> URL: https://issues.apache.org/jira/browse/SPARK-32729
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> the mark of since version is absent for more than 40 math expressions,  which 
> were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc
> we should fill them up properly and may visit other groups of functions 
> latter for the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32729) Fill up missing since version for math expressions

2020-08-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32729:


Assignee: Apache Spark

> Fill up missing since version for math expressions
> --
>
> Key: SPARK-32729
> URL: https://issues.apache.org/jira/browse/SPARK-32729
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> the mark of since version is absent for more than 40 math expressions,  which 
> were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc
> we should fill them up properly and may visit other groups of functions 
> latter for the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32729) Fill up missing since version for math expressions

2020-08-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186589#comment-17186589
 ] 

Apache Spark commented on SPARK-32729:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29571

> Fill up missing since version for math expressions
> --
>
> Key: SPARK-32729
> URL: https://issues.apache.org/jira/browse/SPARK-32729
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> the mark of since version is absent for more than 40 math expressions,  which 
> were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc
> we should fill them up properly and may visit other groups of functions 
> latter for the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32729) Fill up missing since version for math expressions

2020-08-28 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-32729:
-
Description: 
the mark of since version is absent for more than 40 math expressions,  which 
were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc

we should fill them up properly and may visit other groups of functions later 
for the same.

  was:
the mark of since version is absent for more than 40 math expressions,  which 
were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc

we should fill them up properly and may visit other groups of functions latter 
for the same.


> Fill up missing since version for math expressions
> --
>
> Key: SPARK-32729
> URL: https://issues.apache.org/jira/browse/SPARK-32729
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> the mark of since version is absent for more than 40 math expressions,  which 
> were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc
> we should fill them up properly and may visit other groups of functions later 
> for the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32729) Fill up missing since version for math expressions

2020-08-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32729.
--
Fix Version/s: 3.1.0
 Assignee: Kent Yao
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/29571

> Fill up missing since version for math expressions
> --
>
> Key: SPARK-32729
> URL: https://issues.apache.org/jira/browse/SPARK-32729
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.1.0
>
>
> the mark of since version is absent for more than 40 math expressions,  which 
> were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc
> we should fill them up properly and may visit other groups of functions later 
> for the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32721) Simplify if clauses with null and boolean

2020-08-28 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-32721:
-
Description: 
The following if clause:
{code:sql}
if(p, null, false) 
{code}
can be simplified to:
{code:sql}
and(p, null)
{code}

And similarly, the following clause:
{code:sql}
if(p, null, true)
{code}
can be simplified to:
{code:sql}
or(not(p), null)
{code}

iff predicate {{p}} is deterministic, i.e., can be evaluated to either true or 
false, but not null.

  was:
The following if clause:
{code:sql}
if(p, null, false) 
{code}
can be simplified to:
{code:sql}
and(p, null)
{code}

And similarly, the following clause:
{code:sql}
if(p, null, true)
{code}
can be simplified to:
{code:sql}
or(p, null)
{code}


> Simplify if clauses with null and boolean
> -
>
> Key: SPARK-32721
> URL: https://issues.apache.org/jira/browse/SPARK-32721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Major
>
> The following if clause:
> {code:sql}
> if(p, null, false) 
> {code}
> can be simplified to:
> {code:sql}
> and(p, null)
> {code}
> And similarly, the following clause:
> {code:sql}
> if(p, null, true)
> {code}
> can be simplified to:
> {code:sql}
> or(not(p), null)
> {code}
> iff predicate {{p}} is deterministic, i.e., can be evaluated to either true 
> or false, but not null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32721) Simplify if clauses with null and boolean

2020-08-28 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-32721:
-
Description: 
The following if clause:
{code:sql}
if(p, null, false) 
{code}
can be simplified to:
{code:sql}
and(p, null)
{code}

And similarly, the following clause:
{code:sql}
if(p, null, true)
{code}
can be simplified to:
{code:sql}
or(not(p), null)
{code}

iff predicate {{p}} is deterministic, i.e., can be evaluated to either true or 
false, but not null.

{{and}} and {{or}} clauses are more optimization friendly. For instance, by 
converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can 
potentially push the filter {{col > 42}} down to data sources to avoid 
unnecessary IO.

  was:
The following if clause:
{code:sql}
if(p, null, false) 
{code}
can be simplified to:
{code:sql}
and(p, null)
{code}

And similarly, the following clause:
{code:sql}
if(p, null, true)
{code}
can be simplified to:
{code:sql}
or(not(p), null)
{code}

iff predicate {{p}} is deterministic, i.e., can be evaluated to either true or 
false, but not null.


> Simplify if clauses with null and boolean
> -
>
> Key: SPARK-32721
> URL: https://issues.apache.org/jira/browse/SPARK-32721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Major
>
> The following if clause:
> {code:sql}
> if(p, null, false) 
> {code}
> can be simplified to:
> {code:sql}
> and(p, null)
> {code}
> And similarly, the following clause:
> {code:sql}
> if(p, null, true)
> {code}
> can be simplified to:
> {code:sql}
> or(not(p), null)
> {code}
> iff predicate {{p}} is deterministic, i.e., can be evaluated to either true 
> or false, but not null.
> {{and}} and {{or}} clauses are more optimization friendly. For instance, by 
> converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can 
> potentially push the filter {{col > 42}} down to data sources to avoid 
> unnecessary IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32704) Logging plan changes for execution

2020-08-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32704:
---

Assignee: Takeshi Yamamuro

> Logging plan changes for execution
> --
>
> Key: SPARK-32704
> URL: https://issues.apache.org/jira/browse/SPARK-32704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> Since we only log plan changes for analyzer/optimizer now, this ticket 
> targets adding code to log plan changes in the preparation phase in 
> QueryExecution for execution.
> {code}
> scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
> scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
> ...
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)
>   +- *(1) Range (0, 10, step=1, splits=4)
>  
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Result of Batch Preparations ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32704) Logging plan changes for execution

2020-08-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32704.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29544
[https://github.com/apache/spark/pull/29544]

> Logging plan changes for execution
> --
>
> Key: SPARK-32704
> URL: https://issues.apache.org/jira/browse/SPARK-32704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.1.0
>
>
> Since we only log plan changes for analyzer/optimizer now, this ticket 
> targets adding code to log plan changes in the preparation phase in 
> QueryExecution for execution.
> {code}
> scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
> scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
> ...
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)
>   +- *(1) Range (0, 10, step=1, splits=4)
>  
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Result of Batch Preparations ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32639) Support GroupType parquet mapkey field

2020-08-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32639.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29451
[https://github.com/apache/spark/pull/29451]

> Support GroupType parquet mapkey field
> --
>
> Key: SPARK-32639
> URL: https://issues.apache.org/jira/browse/SPARK-32639
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Chen Zhang
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: 000.snappy.parquet
>
>
> I have a parquet file, and the MessageType recorded in the file is:
> {code:java}
> message parquet_schema {
>   optional group value (MAP) {
> repeated group key_value {
>   required group key {
> optional binary first (UTF8);
> optional binary middle (UTF8);
> optional binary last (UTF8);
>   }
>   optional binary value (UTF8);
> }
>   }
> }{code}
>  
> Use +spark.read.parquet("000.snappy.parquet")+ to read the file. Spark will 
> throw an exception when converting Parquet MessageType to Spark SQL 
> StructType:
> {code:java}
> AssertionError(Map key type is expected to be a primitive type, but found...)
> {code}
>  
> Use +spark.read.schema("value MAP last:STRING>, STRING>").parquet("000.snappy.parquet")+ to read the file, 
> spark returns the correct result .
> According to the parquet project document 
> (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), 
> the mapKey in the parquet format does not need to be a primitive type.
>  
> Note: This parquet file is not written by spark, because spark will write 
> additional sparkSchema string information in the parquet file. When Spark 
> reads, it will directly use the additional sparkSchema information in the 
> file instead of converting Parquet MessageType to Spark SQL StructType.
> I will submit a PR later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32639) Support GroupType parquet mapkey field

2020-08-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32639:
---

Assignee: Chen Zhang

> Support GroupType parquet mapkey field
> --
>
> Key: SPARK-32639
> URL: https://issues.apache.org/jira/browse/SPARK-32639
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: 000.snappy.parquet
>
>
> I have a parquet file, and the MessageType recorded in the file is:
> {code:java}
> message parquet_schema {
>   optional group value (MAP) {
> repeated group key_value {
>   required group key {
> optional binary first (UTF8);
> optional binary middle (UTF8);
> optional binary last (UTF8);
>   }
>   optional binary value (UTF8);
> }
>   }
> }{code}
>  
> Use +spark.read.parquet("000.snappy.parquet")+ to read the file. Spark will 
> throw an exception when converting Parquet MessageType to Spark SQL 
> StructType:
> {code:java}
> AssertionError(Map key type is expected to be a primitive type, but found...)
> {code}
>  
> Use +spark.read.schema("value MAP last:STRING>, STRING>").parquet("000.snappy.parquet")+ to read the file, 
> spark returns the correct result .
> According to the parquet project document 
> (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), 
> the mapKey in the parquet format does not need to be a primitive type.
>  
> Note: This parquet file is not written by spark, because spark will write 
> additional sparkSchema string information in the parquet file. When Spark 
> reads, it will directly use the additional sparkSchema information in the 
> file instead of converting Parquet MessageType to Spark SQL StructType.
> I will submit a PR later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32730) Improve LeftSemi SortMergeJoin right side buffering

2020-08-28 Thread Peter Toth (Jira)
Peter Toth created SPARK-32730:
--

 Summary: Improve LeftSemi SortMergeJoin right side buffering
 Key: SPARK-32730
 URL: https://issues.apache.org/jira/browse/SPARK-32730
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Peter Toth


LeftSemi SortMergeJoin should not buffer all matching right side rows when 
bound condition is empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32730) Improve LeftSemi SortMergeJoin right side buffering

2020-08-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186792#comment-17186792
 ] 

Apache Spark commented on SPARK-32730:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29572

> Improve LeftSemi SortMergeJoin right side buffering
> ---
>
> Key: SPARK-32730
> URL: https://issues.apache.org/jira/browse/SPARK-32730
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Priority: Minor
>
> LeftSemi SortMergeJoin should not buffer all matching right side rows when 
> bound condition is empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32730) Improve LeftSemi SortMergeJoin right side buffering

2020-08-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32730:


Assignee: (was: Apache Spark)

> Improve LeftSemi SortMergeJoin right side buffering
> ---
>
> Key: SPARK-32730
> URL: https://issues.apache.org/jira/browse/SPARK-32730
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Priority: Minor
>
> LeftSemi SortMergeJoin should not buffer all matching right side rows when 
> bound condition is empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32730) Improve LeftSemi SortMergeJoin right side buffering

2020-08-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32730:


Assignee: Apache Spark

> Improve LeftSemi SortMergeJoin right side buffering
> ---
>
> Key: SPARK-32730
> URL: https://issues.apache.org/jira/browse/SPARK-32730
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Minor
>
> LeftSemi SortMergeJoin should not buffer all matching right side rows when 
> bound condition is empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32730) Improve LeftSemi SortMergeJoin right side buffering

2020-08-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186791#comment-17186791
 ] 

Apache Spark commented on SPARK-32730:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29572

> Improve LeftSemi SortMergeJoin right side buffering
> ---
>
> Key: SPARK-32730
> URL: https://issues.apache.org/jira/browse/SPARK-32730
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Priority: Minor
>
> LeftSemi SortMergeJoin should not buffer all matching right side rows when 
> bound condition is empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32731) Added tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse

2020-08-28 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created SPARK-32731:
---

 Summary: Added tests for arrays/maps of nested structs to 
ReadSchemaSuite to test structs reuse
 Key: SPARK-32731
 URL: https://issues.apache.org/jira/browse/SPARK-32731
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Splitting tests originally posted in 
[PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added 
tests cover cases for maps and arrays of nested structs for different file 
formats. Eg, [https://github.com/apache/spark/pull/29353] and 
[https://github.com/apache/spark/pull/29354] add object reuse when reading ORC 
and Avro files. However, for dynamic data structures like arrays and maps, we 
do not know just by looking at the schema what the size of the data structure 
will be so it has to be allocated when reading the data points. The added tests 
provide coverage so that objects are not accidentally reused when encountering 
maps and arrays.

AFAIK this is not covered by existing tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse

2020-08-28 Thread Muhammad Samir Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32731:

Summary: Add tests for arrays/maps of nested structs to ReadSchemaSuite to 
test structs reuse  (was: Added tests for arrays/maps of nested structs to 
ReadSchemaSuite to test structs reuse)

> Add tests for arrays/maps of nested structs to ReadSchemaSuite to test 
> structs reuse
> 
>
> Key: SPARK-32731
> URL: https://issues.apache.org/jira/browse/SPARK-32731
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Splitting tests originally posted in 
> [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added 
> tests cover cases for maps and arrays of nested structs for different file 
> formats. Eg, [https://github.com/apache/spark/pull/29353] and 
> [https://github.com/apache/spark/pull/29354] add object reuse when reading 
> ORC and Avro files. However, for dynamic data structures like arrays and 
> maps, we do not know just by looking at the schema what the size of the data 
> structure will be so it has to be allocated when reading the data points. The 
> added tests provide coverage so that objects are not accidentally reused when 
> encountering maps and arrays.
> AFAIK this is not covered by existing tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse

2020-08-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32731:


Assignee: Apache Spark

> Add tests for arrays/maps of nested structs to ReadSchemaSuite to test 
> structs reuse
> 
>
> Key: SPARK-32731
> URL: https://issues.apache.org/jira/browse/SPARK-32731
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Assignee: Apache Spark
>Priority: Major
>
> Splitting tests originally posted in 
> [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added 
> tests cover cases for maps and arrays of nested structs for different file 
> formats. Eg, [https://github.com/apache/spark/pull/29353] and 
> [https://github.com/apache/spark/pull/29354] add object reuse when reading 
> ORC and Avro files. However, for dynamic data structures like arrays and 
> maps, we do not know just by looking at the schema what the size of the data 
> structure will be so it has to be allocated when reading the data points. The 
> added tests provide coverage so that objects are not accidentally reused when 
> encountering maps and arrays.
> AFAIK this is not covered by existing tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse

2020-08-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186825#comment-17186825
 ] 

Apache Spark commented on SPARK-32731:
--

User 'msamirkhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29573

> Add tests for arrays/maps of nested structs to ReadSchemaSuite to test 
> structs reuse
> 
>
> Key: SPARK-32731
> URL: https://issues.apache.org/jira/browse/SPARK-32731
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Splitting tests originally posted in 
> [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added 
> tests cover cases for maps and arrays of nested structs for different file 
> formats. Eg, [https://github.com/apache/spark/pull/29353] and 
> [https://github.com/apache/spark/pull/29354] add object reuse when reading 
> ORC and Avro files. However, for dynamic data structures like arrays and 
> maps, we do not know just by looking at the schema what the size of the data 
> structure will be so it has to be allocated when reading the data points. The 
> added tests provide coverage so that objects are not accidentally reused when 
> encountering maps and arrays.
> AFAIK this is not covered by existing tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse

2020-08-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32731:


Assignee: (was: Apache Spark)

> Add tests for arrays/maps of nested structs to ReadSchemaSuite to test 
> structs reuse
> 
>
> Key: SPARK-32731
> URL: https://issues.apache.org/jira/browse/SPARK-32731
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Splitting tests originally posted in 
> [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added 
> tests cover cases for maps and arrays of nested structs for different file 
> formats. Eg, [https://github.com/apache/spark/pull/29353] and 
> [https://github.com/apache/spark/pull/29354] add object reuse when reading 
> ORC and Avro files. However, for dynamic data structures like arrays and 
> maps, we do not know just by looking at the schema what the size of the data 
> structure will be so it has to be allocated when reading the data points. The 
> added tests provide coverage so that objects are not accidentally reused when 
> encountering maps and arrays.
> AFAIK this is not covered by existing tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse

2020-08-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186826#comment-17186826
 ] 

Apache Spark commented on SPARK-32731:
--

User 'msamirkhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29573

> Add tests for arrays/maps of nested structs to ReadSchemaSuite to test 
> structs reuse
> 
>
> Key: SPARK-32731
> URL: https://issues.apache.org/jira/browse/SPARK-32731
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Splitting tests originally posted in 
> [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added 
> tests cover cases for maps and arrays of nested structs for different file 
> formats. Eg, [https://github.com/apache/spark/pull/29353] and 
> [https://github.com/apache/spark/pull/29354] add object reuse when reading 
> ORC and Avro files. However, for dynamic data structures like arrays and 
> maps, we do not know just by looking at the schema what the size of the data 
> structure will be so it has to be allocated when reading the data points. The 
> added tests provide coverage so that objects are not accidentally reused when 
> encountering maps and arrays.
> AFAIK this is not covered by existing tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32732) Convert schema only once in OrcSerializer

2020-08-28 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created SPARK-32732:
---

 Summary: Convert schema only once in OrcSerializer
 Key: SPARK-32732
 URL: https://issues.apache.org/jira/browse/SPARK-32732
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


This is to track a TODO item in [Pull 
Request|[https://github.com/apache/spark/pull/29352]] for SPARK-32532



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies

2020-08-28 Thread Vladimir Matveev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186846#comment-17186846
 ] 

Vladimir Matveev commented on SPARK-32385:
--

Sorry for the delayed response!

> This requires us fixing every version of every transitive dependency. How 
> does that get updated as the transitive dependency graph changes? this 
> exchanges one problem for another I think. That is, we are definitely not 
> trying to fix dependency versions except where necessary.

I don't think this is right — you don't have to fix more than just direct 
dependencies, like you already do. It's pretty much the same thing as defining 
the version numbers like 
[here|https://github.com/apache/spark/blob/a0bd273bb04d9a5684e291ec44617972dcd4accd/pom.xml#L121-L197]
 and then declaring specific dependencies with the versions below. It's just it 
is done slightly differently, by using Maven's `` 
mechanism and POM inheritance (for Maven; for Gradle e.g. it would be this 
"platform" thing).

> Gradle isn't something that this project supports, but, wouldn't this be a 
> much bigger general problem if its resolution rules are different from Maven? 
> that is, surely gradle can emulate Maven if necessary.

I don't think Gradle can emulate Maven, and I personally don't think it should, 
because Maven's strategy for conflict resolution is quite unconventional, and 
is not used by most of the dependency management tools, not just in the Java 
world. Also, I naturally don't have statistics, so this is just my speculation, 
but it seems likely to me that most of the downstream projects which use Spark 
don't actually use Maven for dependency management, especially given its Scala 
heritage. Therefore, they can't take advantage of Maven's dependency resolution 
algorithm and the current Spark's POM configuration.

Also I'd like to point out again that this whole BOM mechanism is something 
which _Maven_ supports natively, it's not a Gradle extension or something. The 
BOM concept originated in Maven, and it is declared using Maven's 
{{}} block, which is a part of POM syntax. Hopefully this 
would reduce some of the concerns about it.

> Publish a "bill of materials" (BOM) descriptor for Spark with correct 
> versions of various dependencies
> --
>
> Key: SPARK-32385
> URL: https://issues.apache.org/jira/browse/SPARK-32385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Vladimir Matveev
>Priority: Major
>
> Spark has a lot of dependencies, many of them very common (e.g. Guava, 
> Jackson). Also, versions of these dependencies are not updated as frequently 
> as they are released upstream, which is totally understandable and natural, 
> but which also means that often Spark has a dependency on a lower version of 
> a library, which is incompatible with a higher, more recent version of the 
> same library. This incompatibility can manifest in different ways, e.g as 
> classpath errors or runtime check errors (like with Jackson), in certain 
> cases.
>  
> Spark does attempt to "fix" versions of its dependencies by declaring them 
> explicitly in its {{pom.xml}} file. However, this approach, being somewhat 
> workable if the Spark-using project itself uses Maven, breaks down if another 
> build system is used, like Gradle. The reason is that Maven uses an 
> unconventional "nearest first" version conflict resolution strategy, while 
> many other tools like Gradle use the "highest first" strategy which resolves 
> the highest possible version number inside the entire graph of dependencies. 
> This means that other dependencies of the project can pull a higher version 
> of some dependency, which is incompatible with Spark.
>  
> One example would be an explicit or a transitive dependency on a higher 
> version of Jackson in the project. Spark itself depends on several modules of 
> Jackson; if only one of them gets a higher version, and others remain on the 
> lower version, this will result in runtime exceptions due to an internal 
> version check in Jackson.
>  
> A widely used solution for this kind of version issues is publishing of a 
> "bill of materials" descriptor (see here: 
> [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html]
>  and here: 
> [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). 
> This descriptor would contain all versions of all dependencies of Spark; then 
> downstream projects will be able to use their build system's support for BOMs 
> to enforce version constraints required for Spark to function correctly.
>  
> One example of successful implementation of the BOM-based approach is Spring: 
> [https://w

[jira] [Comment Edited] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies

2020-08-28 Thread Vladimir Matveev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186846#comment-17186846
 ] 

Vladimir Matveev edited comment on SPARK-32385 at 8/28/20, 11:37 PM:
-

[~srowen] Sorry for the delayed response!

> This requires us fixing every version of every transitive dependency. How 
> does that get updated as the transitive dependency graph changes? this 
> exchanges one problem for another I think. That is, we are definitely not 
> trying to fix dependency versions except where necessary.

I don't think this is right — you don't have to fix more than just direct 
dependencies, like you already do. It's pretty much the same thing as defining 
the version numbers like 
[here|https://github.com/apache/spark/blob/a0bd273bb04d9a5684e291ec44617972dcd4accd/pom.xml#L121-L197]
 and then declaring specific dependencies with the versions below. It's just it 
is done slightly differently, by using Maven's {{}} 
mechanism and POM inheritance (for Maven; for Gradle e.g. it would be this 
"platform" thing).

> Gradle isn't something that this project supports, but, wouldn't this be a 
> much bigger general problem if its resolution rules are different from Maven? 
> that is, surely gradle can emulate Maven if necessary.

I don't think Gradle can emulate Maven, and I personally don't think it should, 
because Maven's strategy for conflict resolution is quite unconventional, and 
is not used by most of the dependency management tools, not just in the Java 
world. Also, I naturally don't have statistics, so this is just my speculation, 
but it seems likely to me that most of the downstream projects which use Spark 
don't actually use Maven for dependency management, especially given its Scala 
heritage. Therefore, they can't take advantage of Maven's dependency resolution 
algorithm and the current Spark's POM configuration.

Also I'd like to point out again that this whole BOM mechanism is something 
which _Maven_ supports natively, it's not a Gradle extension or something. The 
BOM concept originated in Maven, and it is declared using Maven's 
{{}} block, which is a part of POM syntax. Hopefully this 
would reduce some of the concerns about it.


was (Author: netvl):
Sorry for the delayed response!

> This requires us fixing every version of every transitive dependency. How 
> does that get updated as the transitive dependency graph changes? this 
> exchanges one problem for another I think. That is, we are definitely not 
> trying to fix dependency versions except where necessary.

I don't think this is right — you don't have to fix more than just direct 
dependencies, like you already do. It's pretty much the same thing as defining 
the version numbers like 
[here|https://github.com/apache/spark/blob/a0bd273bb04d9a5684e291ec44617972dcd4accd/pom.xml#L121-L197]
 and then declaring specific dependencies with the versions below. It's just it 
is done slightly differently, by using Maven's `` 
mechanism and POM inheritance (for Maven; for Gradle e.g. it would be this 
"platform" thing).

> Gradle isn't something that this project supports, but, wouldn't this be a 
> much bigger general problem if its resolution rules are different from Maven? 
> that is, surely gradle can emulate Maven if necessary.

I don't think Gradle can emulate Maven, and I personally don't think it should, 
because Maven's strategy for conflict resolution is quite unconventional, and 
is not used by most of the dependency management tools, not just in the Java 
world. Also, I naturally don't have statistics, so this is just my speculation, 
but it seems likely to me that most of the downstream projects which use Spark 
don't actually use Maven for dependency management, especially given its Scala 
heritage. Therefore, they can't take advantage of Maven's dependency resolution 
algorithm and the current Spark's POM configuration.

Also I'd like to point out again that this whole BOM mechanism is something 
which _Maven_ supports natively, it's not a Gradle extension or something. The 
BOM concept originated in Maven, and it is declared using Maven's 
{{}} block, which is a part of POM syntax. Hopefully this 
would reduce some of the concerns about it.

> Publish a "bill of materials" (BOM) descriptor for Spark with correct 
> versions of various dependencies
> --
>
> Key: SPARK-32385
> URL: https://issues.apache.org/jira/browse/SPARK-32385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Vladimir Matveev
>Priority: Major
>
> Spark has a lot of dependencies, many of them very common (e.g. Guava, 
> Jackson). Also, versions of these dependencies are not updated as freq

[jira] [Commented] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies

2020-08-28 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186850#comment-17186850
 ] 

Sean R. Owen commented on SPARK-32385:
--

OK, so it's just a different theory of organizing the Maven dependencies? that 
could be OK to clean up, but I think we'd have to see a prototype of some of 
the change to understand what it's going to mean. I get it, BOMs are just a 
design pattern in Maven, not some new tool or thing.

I am still not quite sure what this gains if it doesn't change dependency 
resolution. Is it just that you declare one artifact POM to depend on that 
declares a bunch of dependent versions, so people don't go depending on 
different versions? For Spark artifacts? I mean people can already do that by 
setting some spark.version property in their build. If it doesn't change 
transitive dependency handling, what does it do - or does it?

I have no opinion on whether closest-first or latest-first resolution is more 
sound. I think Maven is still probably more widely used but don't particularly 
care. What is important is: if we change the build and it changes Spark's 
transitive dependencies for downstream users, that could be a breaking change. 
Or: anything we can do to make the dependency resolution consistent across SBT, 
Gradle etc is a win for sure.

> Publish a "bill of materials" (BOM) descriptor for Spark with correct 
> versions of various dependencies
> --
>
> Key: SPARK-32385
> URL: https://issues.apache.org/jira/browse/SPARK-32385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Vladimir Matveev
>Priority: Major
>
> Spark has a lot of dependencies, many of them very common (e.g. Guava, 
> Jackson). Also, versions of these dependencies are not updated as frequently 
> as they are released upstream, which is totally understandable and natural, 
> but which also means that often Spark has a dependency on a lower version of 
> a library, which is incompatible with a higher, more recent version of the 
> same library. This incompatibility can manifest in different ways, e.g as 
> classpath errors or runtime check errors (like with Jackson), in certain 
> cases.
>  
> Spark does attempt to "fix" versions of its dependencies by declaring them 
> explicitly in its {{pom.xml}} file. However, this approach, being somewhat 
> workable if the Spark-using project itself uses Maven, breaks down if another 
> build system is used, like Gradle. The reason is that Maven uses an 
> unconventional "nearest first" version conflict resolution strategy, while 
> many other tools like Gradle use the "highest first" strategy which resolves 
> the highest possible version number inside the entire graph of dependencies. 
> This means that other dependencies of the project can pull a higher version 
> of some dependency, which is incompatible with Spark.
>  
> One example would be an explicit or a transitive dependency on a higher 
> version of Jackson in the project. Spark itself depends on several modules of 
> Jackson; if only one of them gets a higher version, and others remain on the 
> lower version, this will result in runtime exceptions due to an internal 
> version check in Jackson.
>  
> A widely used solution for this kind of version issues is publishing of a 
> "bill of materials" descriptor (see here: 
> [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html]
>  and here: 
> [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). 
> This descriptor would contain all versions of all dependencies of Spark; then 
> downstream projects will be able to use their build system's support for BOMs 
> to enforce version constraints required for Spark to function correctly.
>  
> One example of successful implementation of the BOM-based approach is Spring: 
> [https://www.baeldung.com/spring-maven-bom#spring-bom]. For different Spring 
> projects, e.g. Spring Boot, there are BOM descriptors published which can be 
> used in downstream projects to fix the versions of Spring components and 
> their dependencies, significantly reducing confusion around proper version 
> numbers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19256) Hive bucketing write support

2020-08-28 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-19256:
-
Affects Version/s: 3.1.0

> Hive bucketing write support
> 
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. 
> The goal is for Spark to write Hive bucketed table, to be compatible with 
> other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as 
> non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write 
> orc/parquet bucketed table as non-bucketed table (code path: 
> InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception 
> by default if writing non-orc/parquet bucketed table (code path: 
> InsertIntoHiveTable), and exception can be disabled by setting config 
> `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will 
> write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash 
> (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket 
> (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on 
> read path - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212]
>  .
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash 
> ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket 
> ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and 
> Hive. Here with this JIRA, we need to add support writing Hive bucketed table 
> with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 
> 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical 
> change and we decide to wait until data source v2 supports bucketing, and do 
> the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property

2020-08-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186853#comment-17186853
 ] 

Apache Spark commented on SPARK-32693:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29575

> Compare two dataframes with same schema except nullable property
> 
>
> Key: SPARK-32693
> URL: https://issues.apache.org/jira/browse/SPARK-32693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.0
>Reporter: david bernuau
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.0
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  
>  
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  
>  
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property

2020-08-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186854#comment-17186854
 ] 

Apache Spark commented on SPARK-32693:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29575

> Compare two dataframes with same schema except nullable property
> 
>
> Key: SPARK-32693
> URL: https://issues.apache.org/jira/browse/SPARK-32693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.0
>Reporter: david bernuau
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.0
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  
>  
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  
>  
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies

2020-08-28 Thread Vladimir Matveev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186855#comment-17186855
 ] 

Vladimir Matveev commented on SPARK-32385:
--

> I am still not quite sure what this gains if it doesn't change dependency 
> resolution.

It does not change dependency resolution by itself, it adds an ability for the 
user, if they want to, to automatically lock onto versions explicitly declared 
by the Spark project. So yeah, this:

> Is it just that you declare one artifact POM to depend on that declares a 
> bunch of dependent versions, so people don't go depending on different 
> versions?

pretty much summarizes it; this could be expanded to say that (depending on the 
build tool) it may also enforce these dependent versions in case of conflicts.

> I mean people can already do that by setting some spark.version property in 
> their build.

They can't in general, because while it will enforce the Spark's own version, 
it won't necessarily determine the versions of transitive dependencies. The 
latter will only happen when the consumer also uses Maven, and when they have a 
particular order of dependencies in their POM declaration (e.g. no newer 
Jackson version declared transitively lexically earlier than Spark).

> What is important is: if we change the build and it changes Spark's 
> transitive dependencies for downstream users, that could be a breaking change.

My understanding is that this should not happen, unless the user explicitly 
opts into using the BOM, in which case it arguably changes the situation for 
the better in most cases, because now versions are guaranteed to align with 
Spark's declarations.

> Publish a "bill of materials" (BOM) descriptor for Spark with correct 
> versions of various dependencies
> --
>
> Key: SPARK-32385
> URL: https://issues.apache.org/jira/browse/SPARK-32385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Vladimir Matveev
>Priority: Major
>
> Spark has a lot of dependencies, many of them very common (e.g. Guava, 
> Jackson). Also, versions of these dependencies are not updated as frequently 
> as they are released upstream, which is totally understandable and natural, 
> but which also means that often Spark has a dependency on a lower version of 
> a library, which is incompatible with a higher, more recent version of the 
> same library. This incompatibility can manifest in different ways, e.g as 
> classpath errors or runtime check errors (like with Jackson), in certain 
> cases.
>  
> Spark does attempt to "fix" versions of its dependencies by declaring them 
> explicitly in its {{pom.xml}} file. However, this approach, being somewhat 
> workable if the Spark-using project itself uses Maven, breaks down if another 
> build system is used, like Gradle. The reason is that Maven uses an 
> unconventional "nearest first" version conflict resolution strategy, while 
> many other tools like Gradle use the "highest first" strategy which resolves 
> the highest possible version number inside the entire graph of dependencies. 
> This means that other dependencies of the project can pull a higher version 
> of some dependency, which is incompatible with Spark.
>  
> One example would be an explicit or a transitive dependency on a higher 
> version of Jackson in the project. Spark itself depends on several modules of 
> Jackson; if only one of them gets a higher version, and others remain on the 
> lower version, this will result in runtime exceptions due to an internal 
> version check in Jackson.
>  
> A widely used solution for this kind of version issues is publishing of a 
> "bill of materials" descriptor (see here: 
> [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html]
>  and here: 
> [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). 
> This descriptor would contain all versions of all dependencies of Spark; then 
> downstream projects will be able to use their build system's support for BOMs 
> to enforce version constraints required for Spark to function correctly.
>  
> One example of successful implementation of the BOM-based approach is Spring: 
> [https://www.baeldung.com/spring-maven-bom#spring-bom]. For different Spring 
> projects, e.g. Spring Boot, there are BOM descriptors published which can be 
> used in downstream projects to fix the versions of Spring components and 
> their dependencies, significantly reducing confusion around proper version 
> numbers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark

[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property

2020-08-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186857#comment-17186857
 ] 

Apache Spark commented on SPARK-32693:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29576

> Compare two dataframes with same schema except nullable property
> 
>
> Key: SPARK-32693
> URL: https://issues.apache.org/jira/browse/SPARK-32693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.0
>Reporter: david bernuau
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.0
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  
>  
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  
>  
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property

2020-08-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186859#comment-17186859
 ] 

Apache Spark commented on SPARK-32693:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29576

> Compare two dataframes with same schema except nullable property
> 
>
> Key: SPARK-32693
> URL: https://issues.apache.org/jira/browse/SPARK-32693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.0
>Reporter: david bernuau
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.0
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  
>  
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  
>  
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org