[jira] [Created] (SPARK-32724) java.io.IOException: Stream is corrupted when tried to inner join 4 huge tables. Currently using pyspark version 2.4.0-cdh6.3.1
Kannan created SPARK-32724: -- Summary: java.io.IOException: Stream is corrupted when tried to inner join 4 huge tables. Currently using pyspark version 2.4.0-cdh6.3.1 Key: SPARK-32724 URL: https://issues.apache.org/jira/browse/SPARK-32724 Project: Spark Issue Type: Question Components: PySpark, Spark Core Affects Versions: 2.4.0 Reporter: Kannan When i try to join the 4 tables with 1M data i am getting below error. Py4JJavaError: An error occurred while calling o453.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Aborting TaskSet 27.0 because task 9 (partition 9) cannot run anywhere due to node and executor blacklist. Most recent failure: Lost task 9.1 in stage 27.0 (TID 267, si-159l.de.des.com, executor 17): java.io.IOException: Stream is corrupted at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Blacklisting behavior can be configured via spark.blacklist.*. at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1877) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:929) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:929) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:929) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2111) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2060) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2049) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:740) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2081) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2102) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2121) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2146) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.collect(RDD.scala:944) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2830) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2829) at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363) at org.apache.spark.sql.Dataset.count(Dataset.scala:2829) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (SPARK-32725) Getting jar vulnerability for kryo-shaded for 4.0 version
Wasi created SPARK-32725: Summary: Getting jar vulnerability for kryo-shaded for 4.0 version Key: SPARK-32725 URL: https://issues.apache.org/jira/browse/SPARK-32725 Project: Spark Issue Type: Dependency upgrade Components: Spark Core Affects Versions: 2.4.0 Reporter: Wasi We are getting jar vulnerability for kryo 4.0. Report suggesting to upgrade to 5.x and 5.x doesn't have backward compatibility. By doing upgrade we might need to change org.apache.spark.serializer.kryoserializer files and corresponding instance file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32723) Security Vulnerability due to JQuery version in Spark Master/Worker UI
[ https://issues.apache.org/jira/browse/SPARK-32723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186403#comment-17186403 ] Rohit Mishra commented on SPARK-32723: -- [~ashish23aks], Please refrain from marking the target version. This is reserved for committers. > Security Vulnerability due to JQuery version in Spark Master/Worker UI > -- > > Key: SPARK-32723 > URL: https://issues.apache.org/jira/browse/SPARK-32723 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Ashish Kumar Singh >Priority: Major > Labels: Security > > Spark 3.0, Spark 2.4.x uses JQuery version < 3.5 which has known security > vulnerability in Spark Master UI and Spark Worker UI. > Can we please upgrade JQuery to 3.5 and above ? > [https://www.tenable.com/plugins/nessus/136929] > ??According to the self-reported version in the script, the version of JQuery > hosted on the remote web server is greater than or equal to 1.2 and prior to > 3.5.0. It is, therefore, affected by multiple cross site scripting > vulnerabilities.?? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32726) Filter by column alias in where clause
Yuming Wang created SPARK-32726: --- Summary: Filter by column alias in where clause Key: SPARK-32726 URL: https://issues.apache.org/jira/browse/SPARK-32726 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang {{group by}} and {{order by}} clause support it. but {{where}} does not support it: {noformat} spark-sql> select id + 1 as new_id from range(5) group by new_id order by new_id; 1 2 3 4 5 spark-sql> select id + 1 as new_id from range(5) where new_id > 2; Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 pos 44; 'Project [('id + 1) AS new_id#5] +- 'Filter ('new_id > 2) +- Range (0, 5, step=1, splits=None {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32723) Security Vulnerability due to JQuery version in Spark Master/Worker UI
[ https://issues.apache.org/jira/browse/SPARK-32723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Mishra updated SPARK-32723: - Target Version/s: (was: 3.1.0) > Security Vulnerability due to JQuery version in Spark Master/Worker UI > -- > > Key: SPARK-32723 > URL: https://issues.apache.org/jira/browse/SPARK-32723 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Ashish Kumar Singh >Priority: Major > Labels: Security > > Spark 3.0, Spark 2.4.x uses JQuery version < 3.5 which has known security > vulnerability in Spark Master UI and Spark Worker UI. > Can we please upgrade JQuery to 3.5 and above ? > [https://www.tenable.com/plugins/nessus/136929] > ??According to the self-reported version in the script, the version of JQuery > hosted on the remote web server is greater than or equal to 1.2 and prior to > 3.5.0. It is, therefore, affected by multiple cross site scripting > vulnerabilities.?? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32727) replace CaseWhen with If when there is only one case
Tanel Kiis created SPARK-32727: -- Summary: replace CaseWhen with If when there is only one case Key: SPARK-32727 URL: https://issues.apache.org/jira/browse/SPARK-32727 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Tanel Kiis -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32727) replace CaseWhen with If when there is only one case
[ https://issues.apache.org/jira/browse/SPARK-32727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32727: Assignee: (was: Apache Spark) > replace CaseWhen with If when there is only one case > > > Key: SPARK-32727 > URL: https://issues.apache.org/jira/browse/SPARK-32727 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32727) replace CaseWhen with If when there is only one case
[ https://issues.apache.org/jira/browse/SPARK-32727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186446#comment-17186446 ] Apache Spark commented on SPARK-32727: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/29570 > replace CaseWhen with If when there is only one case > > > Key: SPARK-32727 > URL: https://issues.apache.org/jira/browse/SPARK-32727 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32726) Filter by column alias in where clause
[ https://issues.apache.org/jira/browse/SPARK-32726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32726: Description: {{group by}} and {{order by}} clause support it. but {{where}} does not support it: {noformat} spark-sql> select id + 1 as new_id from range(5) group by new_id order by new_id; 1 2 3 4 5 spark-sql> select id + 1 as new_id from range(5) where new_id > 2; Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 pos 44; 'Project [('id + 1) AS new_id#5] +- 'Filter ('new_id > 2) +- Range (0, 5, step=1, splits=None spark-sql> select id + 1 as new_id, new_id + 1 as new_new_id from range(5); Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 pos 25; 'Project [(id#12L + cast(1 as bigint)) AS new_id#10L, ('new_id + 1) AS new_new_id#11] +- Range (0, 5, step=1, splits=None) {noformat} was: {{group by}} and {{order by}} clause support it. but {{where}} does not support it: {noformat} spark-sql> select id + 1 as new_id from range(5) group by new_id order by new_id; 1 2 3 4 5 spark-sql> select id + 1 as new_id from range(5) where new_id > 2; Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 pos 44; 'Project [('id + 1) AS new_id#5] +- 'Filter ('new_id > 2) +- Range (0, 5, step=1, splits=None {noformat} > Filter by column alias in where clause > -- > > Key: SPARK-32726 > URL: https://issues.apache.org/jira/browse/SPARK-32726 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {{group by}} and {{order by}} clause support it. but {{where}} does not > support it: > {noformat} > spark-sql> select id + 1 as new_id from range(5) group by new_id order by > new_id; > 1 > 2 > 3 > 4 > 5 > spark-sql> select id + 1 as new_id from range(5) where new_id > 2; > Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 > pos 44; > 'Project [('id + 1) AS new_id#5] > +- 'Filter ('new_id > 2) >+- Range (0, 5, step=1, splits=None > spark-sql> select id + 1 as new_id, new_id + 1 as new_new_id from range(5); > Error in query: cannot resolve '`new_id`' given input columns: [id]; line 1 > pos 25; > 'Project [(id#12L + cast(1 as bigint)) AS new_id#10L, ('new_id + 1) AS > new_new_id#11] > +- Range (0, 5, step=1, splits=None) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32727) replace CaseWhen with If when there is only one case
[ https://issues.apache.org/jira/browse/SPARK-32727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32727: Assignee: Apache Spark > replace CaseWhen with If when there is only one case > > > Key: SPARK-32727 > URL: https://issues.apache.org/jira/browse/SPARK-32727 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32717) Add a AQEOptimizer for AdaptiveSparkPlanExec
[ https://issues.apache.org/jira/browse/SPARK-32717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32717. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29559 [https://github.com/apache/spark/pull/29559] > Add a AQEOptimizer for AdaptiveSparkPlanExec > > > Key: SPARK-32717 > URL: https://issues.apache.org/jira/browse/SPARK-32717 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > Currently, AdaptiveSparkPlanExec has implemented an anonymous RuleExecutor to > apply the AQE specific rules on the plan. However, the anonymous class > usually could be inconvenient to maintain and extend for the long term. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32717) Add a AQEOptimizer for AdaptiveSparkPlanExec
[ https://issues.apache.org/jira/browse/SPARK-32717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32717: Assignee: wuyi > Add a AQEOptimizer for AdaptiveSparkPlanExec > > > Key: SPARK-32717 > URL: https://issues.apache.org/jira/browse/SPARK-32717 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Currently, AdaptiveSparkPlanExec has implemented an anonymous RuleExecutor to > apply the AQE specific rules on the plan. However, the anonymous class > usually could be inconvenient to maintain and extend for the long term. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property
[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186517#comment-17186517 ] david bernuau commented on SPARK-32693: --- thanks > Compare two dataframes with same schema except nullable property > > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.0 >Reporter: david bernuau >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.1.0 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > > > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > > > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32728) Using groupby with rand creates different values when joining table with itself
[ https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joachim Bargsten updated SPARK-32728: - Description: When running following query on a cluster with *multiple workers (>1)*, the result is not 0.0, even though I would expect it to be. {code:java} import pyspark.sql.functions as F sdf = spark.range(100) sdf = ( sdf.withColumn("a", F.col("id") + 1) .withColumn("b", F.col("id") + 2) .withColumn("c", F.col("id") + 3) .withColumn("d", F.col("id") + 4) .withColumn("e", F.col("id") + 5) ) sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) sdf = sdf.withColumn("x", F.rand() * F.col("e")) sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - F.col("xx"))).agg(F.sum("delta_x")) sum_delta_x = sdf2.head()[0] print(f"{sum_delta_x} should be 0.0") assert abs(sum_delta_x) < 0.001 {code} If the groupby statement is commented out, the code is working as expected. was: When running following query on a cluster with *multiple workers (>1)*, the result is not 0.0, even though I would expect it to be. {code:java} import pyspark.sql.functions as F sdf = spark.range(100) sdf = ( sdf.withColumn("a", F.col("id") + 1) .withColumn("b", F.col("id") + 2) .withColumn("c", F.col("id") + 3) .withColumn("d", F.col("id") + 4) .withColumn("e", F.col("id") + 5) ) sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) sdf = sdf.withColumn("x", F.rand() * F.col("e")) sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - F.col("xx"))).agg(F.sum("delta_x")) sum_delta_x = sdf2.head()[0] print(f"{sum_delta_x} should be 0.0") assert abs(sum_delta_x) < 0.001 {code} {{}}If the groupby statement is commented out, the code is working as expected. > Using groupby with rand creates different values when joining table with > itself > --- > > Key: SPARK-32728 > URL: https://issues.apache.org/jira/browse/SPARK-32728 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 > Environment: I tested it with environment,Azure Databricks 7.2 > (includes Apache Spark 3.0.0, Scala 2.12) > Worker type: Standard_DS3_v2 (2 workers) > >Reporter: Joachim Bargsten >Priority: Minor > > When running following query on a cluster with *multiple workers (>1)*, the > result is not 0.0, even though I would expect it to be. > {code:java} > import pyspark.sql.functions as F > sdf = spark.range(100) > sdf = ( > sdf.withColumn("a", F.col("id") + 1) > .withColumn("b", F.col("id") + 2) > .withColumn("c", F.col("id") + 3) > .withColumn("d", F.col("id") + 4) > .withColumn("e", F.col("id") + 5) > ) > sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) > sdf = sdf.withColumn("x", F.rand() * F.col("e")) > sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) > sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - > F.col("xx"))).agg(F.sum("delta_x")) > sum_delta_x = sdf2.head()[0] > print(f"{sum_delta_x} should be 0.0") > assert abs(sum_delta_x) < 0.001 > {code} > If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32728) Using groupby with rand creates different values when joining table with itself
Joachim Bargsten created SPARK-32728: Summary: Using groupby with rand creates different values when joining table with itself Key: SPARK-32728 URL: https://issues.apache.org/jira/browse/SPARK-32728 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 2.4.5 Environment: I tested it with environment,Azure Databricks 7.2 (includes Apache Spark 3.0.0, Scala 2.12) Worker type: Standard_DS3_v2 (2 workers) Reporter: Joachim Bargsten When running following query on a cluster with *multiple workers (>1)*, the result is not 0.0, even though I would expect it to be. {code:java} import pyspark.sql.functions as F sdf = spark.range(100) sdf = ( sdf.withColumn("a", F.col("id") + 1) .withColumn("b", F.col("id") + 2) .withColumn("c", F.col("id") + 3) .withColumn("d", F.col("id") + 4) .withColumn("e", F.col("id") + 5) ) sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) sdf = sdf.withColumn("x", F.rand() * F.col("e")) sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - F.col("xx"))).agg(F.sum("delta_x")) sum_delta_x = sdf2.head()[0] print(f"{sum_delta_x} should be 0.0") assert abs(sum_delta_x) < 0.001 {code} {{}}If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32728) Using groupby with rand creates different values when joining table with itself
[ https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joachim Bargsten updated SPARK-32728: - Environment: I tested it with environment,Azure Databricks 7.2 (& 6.6) (includes Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) Worker type: Standard_DS3_v2 (2 workers) was: I tested it with environment,Azure Databricks 7.2 (includes Apache Spark 3.0.0, Scala 2.12) Worker type: Standard_DS3_v2 (2 workers) > Using groupby with rand creates different values when joining table with > itself > --- > > Key: SPARK-32728 > URL: https://issues.apache.org/jira/browse/SPARK-32728 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 > Environment: I tested it with environment,Azure Databricks 7.2 (& > 6.6) (includes Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) > Worker type: Standard_DS3_v2 (2 workers) > >Reporter: Joachim Bargsten >Priority: Minor > > When running following query on a cluster with *multiple workers (>1)*, the > result is not 0.0, even though I would expect it to be. > {code:java} > import pyspark.sql.functions as F > sdf = spark.range(100) > sdf = ( > sdf.withColumn("a", F.col("id") + 1) > .withColumn("b", F.col("id") + 2) > .withColumn("c", F.col("id") + 3) > .withColumn("d", F.col("id") + 4) > .withColumn("e", F.col("id") + 5) > ) > sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) > sdf = sdf.withColumn("x", F.rand() * F.col("e")) > sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) > sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - > F.col("xx"))).agg(F.sum("delta_x")) > sum_delta_x = sdf2.head()[0] > print(f"{sum_delta_x} should be 0.0") > assert abs(sum_delta_x) < 0.001 > {code} > If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32728) Using groupby with rand creates different values when joining table with itself
[ https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joachim Bargsten updated SPARK-32728: - Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) Worker type: Standard_DS3_v2 (2 workers) was: I tested it with environment,Azure Databricks 7.2 (& 6.6) (includes Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) Worker type: Standard_DS3_v2 (2 workers) > Using groupby with rand creates different values when joining table with > itself > --- > > Key: SPARK-32728 > URL: https://issues.apache.org/jira/browse/SPARK-32728 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 > Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes > Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) > Worker type: Standard_DS3_v2 (2 workers) > >Reporter: Joachim Bargsten >Priority: Minor > > When running following query on a cluster with *multiple workers (>1)*, the > result is not 0.0, even though I would expect it to be. > {code:java} > import pyspark.sql.functions as F > sdf = spark.range(100) > sdf = ( > sdf.withColumn("a", F.col("id") + 1) > .withColumn("b", F.col("id") + 2) > .withColumn("c", F.col("id") + 3) > .withColumn("d", F.col("id") + 4) > .withColumn("e", F.col("id") + 5) > ) > sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) > sdf = sdf.withColumn("x", F.rand() * F.col("e")) > sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) > sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - > F.col("xx"))).agg(F.sum("delta_x")) > sum_delta_x = sdf2.head()[0] > print(f"{sum_delta_x} should be 0.0") > assert abs(sum_delta_x) < 0.001 > {code} > If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32728) Using groupby with rand creates different values when joining table with itself
[ https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joachim Bargsten updated SPARK-32728: - Description: When running following query in a python3 notebook on a cluster with *multiple workers (>1)*, the result is not 0.0, even though I would expect it to be. {code:java} import pyspark.sql.functions as F sdf = spark.range(100) sdf = ( sdf.withColumn("a", F.col("id") + 1) .withColumn("b", F.col("id") + 2) .withColumn("c", F.col("id") + 3) .withColumn("d", F.col("id") + 4) .withColumn("e", F.col("id") + 5) ) sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) sdf = sdf.withColumn("x", F.rand() * F.col("e")) sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - F.col("xx"))).agg(F.sum("delta_x")) sum_delta_x = sdf2.head()[0] print(f"{sum_delta_x} should be 0.0") assert abs(sum_delta_x) < 0.001 {code} If the groupby statement is commented out, the code is working as expected. was: When running following query on a cluster with *multiple workers (>1)*, the result is not 0.0, even though I would expect it to be. {code:java} import pyspark.sql.functions as F sdf = spark.range(100) sdf = ( sdf.withColumn("a", F.col("id") + 1) .withColumn("b", F.col("id") + 2) .withColumn("c", F.col("id") + 3) .withColumn("d", F.col("id") + 4) .withColumn("e", F.col("id") + 5) ) sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) sdf = sdf.withColumn("x", F.rand() * F.col("e")) sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - F.col("xx"))).agg(F.sum("delta_x")) sum_delta_x = sdf2.head()[0] print(f"{sum_delta_x} should be 0.0") assert abs(sum_delta_x) < 0.001 {code} If the groupby statement is commented out, the code is working as expected. > Using groupby with rand creates different values when joining table with > itself > --- > > Key: SPARK-32728 > URL: https://issues.apache.org/jira/browse/SPARK-32728 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 > Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes > Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) > Worker type: Standard_DS3_v2 (2 workers) > >Reporter: Joachim Bargsten >Priority: Minor > > When running following query in a python3 notebook on a cluster with > *multiple workers (>1)*, the result is not 0.0, even though I would expect it > to be. > {code:java} > import pyspark.sql.functions as F > sdf = spark.range(100) > sdf = ( > sdf.withColumn("a", F.col("id") + 1) > .withColumn("b", F.col("id") + 2) > .withColumn("c", F.col("id") + 3) > .withColumn("d", F.col("id") + 4) > .withColumn("e", F.col("id") + 5) > ) > sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) > sdf = sdf.withColumn("x", F.rand() * F.col("e")) > sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) > sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - > F.col("xx"))).agg(F.sum("delta_x")) > sum_delta_x = sdf2.head()[0] > print(f"{sum_delta_x} should be 0.0") > assert abs(sum_delta_x) < 0.001 > {code} > If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32729) Fill up missing since version for math expressions
Kent Yao created SPARK-32729: Summary: Fill up missing since version for math expressions Key: SPARK-32729 URL: https://issues.apache.org/jira/browse/SPARK-32729 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.1.0 Reporter: Kent Yao the mark of since version is absent for more than 40 math expressions, which were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc we should fill them up properly and may visit other groups of functions latter for the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32729) Fill up missing since version for math expressions
[ https://issues.apache.org/jira/browse/SPARK-32729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32729: Assignee: (was: Apache Spark) > Fill up missing since version for math expressions > -- > > Key: SPARK-32729 > URL: https://issues.apache.org/jira/browse/SPARK-32729 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Minor > > the mark of since version is absent for more than 40 math expressions, which > were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc > we should fill them up properly and may visit other groups of functions > latter for the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32729) Fill up missing since version for math expressions
[ https://issues.apache.org/jira/browse/SPARK-32729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32729: Assignee: Apache Spark > Fill up missing since version for math expressions > -- > > Key: SPARK-32729 > URL: https://issues.apache.org/jira/browse/SPARK-32729 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Minor > > the mark of since version is absent for more than 40 math expressions, which > were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc > we should fill them up properly and may visit other groups of functions > latter for the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32729) Fill up missing since version for math expressions
[ https://issues.apache.org/jira/browse/SPARK-32729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186589#comment-17186589 ] Apache Spark commented on SPARK-32729: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29571 > Fill up missing since version for math expressions > -- > > Key: SPARK-32729 > URL: https://issues.apache.org/jira/browse/SPARK-32729 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Minor > > the mark of since version is absent for more than 40 math expressions, which > were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc > we should fill them up properly and may visit other groups of functions > latter for the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32729) Fill up missing since version for math expressions
[ https://issues.apache.org/jira/browse/SPARK-32729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-32729: - Description: the mark of since version is absent for more than 40 math expressions, which were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc we should fill them up properly and may visit other groups of functions later for the same. was: the mark of since version is absent for more than 40 math expressions, which were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc we should fill them up properly and may visit other groups of functions latter for the same. > Fill up missing since version for math expressions > -- > > Key: SPARK-32729 > URL: https://issues.apache.org/jira/browse/SPARK-32729 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Minor > > the mark of since version is absent for more than 40 math expressions, which > were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc > we should fill them up properly and may visit other groups of functions later > for the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32729) Fill up missing since version for math expressions
[ https://issues.apache.org/jira/browse/SPARK-32729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32729. -- Fix Version/s: 3.1.0 Assignee: Kent Yao Resolution: Fixed Fixed in https://github.com/apache/spark/pull/29571 > Fill up missing since version for math expressions > -- > > Key: SPARK-32729 > URL: https://issues.apache.org/jira/browse/SPARK-32729 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.1.0 > > > the mark of since version is absent for more than 40 math expressions, which > were added in different versions, v1.1.1, v1.4.0, v1.5.0, v2.0.0, v2.3.0, etc > we should fill them up properly and may visit other groups of functions later > for the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32721) Simplify if clauses with null and boolean
[ https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-32721: - Description: The following if clause: {code:sql} if(p, null, false) {code} can be simplified to: {code:sql} and(p, null) {code} And similarly, the following clause: {code:sql} if(p, null, true) {code} can be simplified to: {code:sql} or(not(p), null) {code} iff predicate {{p}} is deterministic, i.e., can be evaluated to either true or false, but not null. was: The following if clause: {code:sql} if(p, null, false) {code} can be simplified to: {code:sql} and(p, null) {code} And similarly, the following clause: {code:sql} if(p, null, true) {code} can be simplified to: {code:sql} or(p, null) {code} > Simplify if clauses with null and boolean > - > > Key: SPARK-32721 > URL: https://issues.apache.org/jira/browse/SPARK-32721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Priority: Major > > The following if clause: > {code:sql} > if(p, null, false) > {code} > can be simplified to: > {code:sql} > and(p, null) > {code} > And similarly, the following clause: > {code:sql} > if(p, null, true) > {code} > can be simplified to: > {code:sql} > or(not(p), null) > {code} > iff predicate {{p}} is deterministic, i.e., can be evaluated to either true > or false, but not null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32721) Simplify if clauses with null and boolean
[ https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-32721: - Description: The following if clause: {code:sql} if(p, null, false) {code} can be simplified to: {code:sql} and(p, null) {code} And similarly, the following clause: {code:sql} if(p, null, true) {code} can be simplified to: {code:sql} or(not(p), null) {code} iff predicate {{p}} is deterministic, i.e., can be evaluated to either true or false, but not null. {{and}} and {{or}} clauses are more optimization friendly. For instance, by converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can potentially push the filter {{col > 42}} down to data sources to avoid unnecessary IO. was: The following if clause: {code:sql} if(p, null, false) {code} can be simplified to: {code:sql} and(p, null) {code} And similarly, the following clause: {code:sql} if(p, null, true) {code} can be simplified to: {code:sql} or(not(p), null) {code} iff predicate {{p}} is deterministic, i.e., can be evaluated to either true or false, but not null. > Simplify if clauses with null and boolean > - > > Key: SPARK-32721 > URL: https://issues.apache.org/jira/browse/SPARK-32721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Priority: Major > > The following if clause: > {code:sql} > if(p, null, false) > {code} > can be simplified to: > {code:sql} > and(p, null) > {code} > And similarly, the following clause: > {code:sql} > if(p, null, true) > {code} > can be simplified to: > {code:sql} > or(not(p), null) > {code} > iff predicate {{p}} is deterministic, i.e., can be evaluated to either true > or false, but not null. > {{and}} and {{or}} clauses are more optimization friendly. For instance, by > converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can > potentially push the filter {{col > 42}} down to data sources to avoid > unnecessary IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32704) Logging plan changes for execution
[ https://issues.apache.org/jira/browse/SPARK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32704: --- Assignee: Takeshi Yamamuro > Logging plan changes for execution > -- > > Key: SPARK-32704 > URL: https://issues.apache.org/jira/browse/SPARK-32704 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > > Since we only log plan changes for analyzer/optimizer now, this ticket > targets adding code to log plan changes in the preparation phase in > QueryExecution for execution. > {code} > scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN") > scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan > ... > 20/08/26 09:32:36 WARN PlanChangeLogger: > === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages === > !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, > count#23L]) *(1) HashAggregate(keys=[id#19L], > functions=[count(1)], output=[id#19L, count#23L]) > !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], > output=[id#19L, count#27L]) +- *(1) HashAggregate(keys=[id#19L], > functions=[partial_count(1)], output=[id#19L, count#27L]) > ! +- Range (0, 10, step=1, splits=4) > +- *(1) Range (0, 10, step=1, splits=4) > > 20/08/26 09:32:36 WARN PlanChangeLogger: > === Result of Batch Preparations === > !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, > count#23L]) *(1) HashAggregate(keys=[id#19L], > functions=[count(1)], output=[id#19L, count#23L]) > !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], > output=[id#19L, count#27L]) +- *(1) HashAggregate(keys=[id#19L], > functions=[partial_count(1)], output=[id#19L, count#27L]) > ! +- Range (0, 10, step=1, splits=4) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32704) Logging plan changes for execution
[ https://issues.apache.org/jira/browse/SPARK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32704. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29544 [https://github.com/apache/spark/pull/29544] > Logging plan changes for execution > -- > > Key: SPARK-32704 > URL: https://issues.apache.org/jira/browse/SPARK-32704 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.1.0 > > > Since we only log plan changes for analyzer/optimizer now, this ticket > targets adding code to log plan changes in the preparation phase in > QueryExecution for execution. > {code} > scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN") > scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan > ... > 20/08/26 09:32:36 WARN PlanChangeLogger: > === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages === > !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, > count#23L]) *(1) HashAggregate(keys=[id#19L], > functions=[count(1)], output=[id#19L, count#23L]) > !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], > output=[id#19L, count#27L]) +- *(1) HashAggregate(keys=[id#19L], > functions=[partial_count(1)], output=[id#19L, count#27L]) > ! +- Range (0, 10, step=1, splits=4) > +- *(1) Range (0, 10, step=1, splits=4) > > 20/08/26 09:32:36 WARN PlanChangeLogger: > === Result of Batch Preparations === > !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, > count#23L]) *(1) HashAggregate(keys=[id#19L], > functions=[count(1)], output=[id#19L, count#23L]) > !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], > output=[id#19L, count#27L]) +- *(1) HashAggregate(keys=[id#19L], > functions=[partial_count(1)], output=[id#19L, count#27L]) > ! +- Range (0, 10, step=1, splits=4) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32639) Support GroupType parquet mapkey field
[ https://issues.apache.org/jira/browse/SPARK-32639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32639. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29451 [https://github.com/apache/spark/pull/29451] > Support GroupType parquet mapkey field > -- > > Key: SPARK-32639 > URL: https://issues.apache.org/jira/browse/SPARK-32639 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Chen Zhang >Priority: Major > Fix For: 3.1.0 > > Attachments: 000.snappy.parquet > > > I have a parquet file, and the MessageType recorded in the file is: > {code:java} > message parquet_schema { > optional group value (MAP) { > repeated group key_value { > required group key { > optional binary first (UTF8); > optional binary middle (UTF8); > optional binary last (UTF8); > } > optional binary value (UTF8); > } > } > }{code} > > Use +spark.read.parquet("000.snappy.parquet")+ to read the file. Spark will > throw an exception when converting Parquet MessageType to Spark SQL > StructType: > {code:java} > AssertionError(Map key type is expected to be a primitive type, but found...) > {code} > > Use +spark.read.schema("value MAP last:STRING>, STRING>").parquet("000.snappy.parquet")+ to read the file, > spark returns the correct result . > According to the parquet project document > (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), > the mapKey in the parquet format does not need to be a primitive type. > > Note: This parquet file is not written by spark, because spark will write > additional sparkSchema string information in the parquet file. When Spark > reads, it will directly use the additional sparkSchema information in the > file instead of converting Parquet MessageType to Spark SQL StructType. > I will submit a PR later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32639) Support GroupType parquet mapkey field
[ https://issues.apache.org/jira/browse/SPARK-32639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32639: --- Assignee: Chen Zhang > Support GroupType parquet mapkey field > -- > > Key: SPARK-32639 > URL: https://issues.apache.org/jira/browse/SPARK-32639 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Fix For: 3.1.0 > > Attachments: 000.snappy.parquet > > > I have a parquet file, and the MessageType recorded in the file is: > {code:java} > message parquet_schema { > optional group value (MAP) { > repeated group key_value { > required group key { > optional binary first (UTF8); > optional binary middle (UTF8); > optional binary last (UTF8); > } > optional binary value (UTF8); > } > } > }{code} > > Use +spark.read.parquet("000.snappy.parquet")+ to read the file. Spark will > throw an exception when converting Parquet MessageType to Spark SQL > StructType: > {code:java} > AssertionError(Map key type is expected to be a primitive type, but found...) > {code} > > Use +spark.read.schema("value MAP last:STRING>, STRING>").parquet("000.snappy.parquet")+ to read the file, > spark returns the correct result . > According to the parquet project document > (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), > the mapKey in the parquet format does not need to be a primitive type. > > Note: This parquet file is not written by spark, because spark will write > additional sparkSchema string information in the parquet file. When Spark > reads, it will directly use the additional sparkSchema information in the > file instead of converting Parquet MessageType to Spark SQL StructType. > I will submit a PR later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32730) Improve LeftSemi SortMergeJoin right side buffering
Peter Toth created SPARK-32730: -- Summary: Improve LeftSemi SortMergeJoin right side buffering Key: SPARK-32730 URL: https://issues.apache.org/jira/browse/SPARK-32730 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Peter Toth LeftSemi SortMergeJoin should not buffer all matching right side rows when bound condition is empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32730) Improve LeftSemi SortMergeJoin right side buffering
[ https://issues.apache.org/jira/browse/SPARK-32730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186792#comment-17186792 ] Apache Spark commented on SPARK-32730: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/29572 > Improve LeftSemi SortMergeJoin right side buffering > --- > > Key: SPARK-32730 > URL: https://issues.apache.org/jira/browse/SPARK-32730 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Priority: Minor > > LeftSemi SortMergeJoin should not buffer all matching right side rows when > bound condition is empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32730) Improve LeftSemi SortMergeJoin right side buffering
[ https://issues.apache.org/jira/browse/SPARK-32730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32730: Assignee: (was: Apache Spark) > Improve LeftSemi SortMergeJoin right side buffering > --- > > Key: SPARK-32730 > URL: https://issues.apache.org/jira/browse/SPARK-32730 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Priority: Minor > > LeftSemi SortMergeJoin should not buffer all matching right side rows when > bound condition is empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32730) Improve LeftSemi SortMergeJoin right side buffering
[ https://issues.apache.org/jira/browse/SPARK-32730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32730: Assignee: Apache Spark > Improve LeftSemi SortMergeJoin right side buffering > --- > > Key: SPARK-32730 > URL: https://issues.apache.org/jira/browse/SPARK-32730 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Assignee: Apache Spark >Priority: Minor > > LeftSemi SortMergeJoin should not buffer all matching right side rows when > bound condition is empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32730) Improve LeftSemi SortMergeJoin right side buffering
[ https://issues.apache.org/jira/browse/SPARK-32730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186791#comment-17186791 ] Apache Spark commented on SPARK-32730: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/29572 > Improve LeftSemi SortMergeJoin right side buffering > --- > > Key: SPARK-32730 > URL: https://issues.apache.org/jira/browse/SPARK-32730 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Priority: Minor > > LeftSemi SortMergeJoin should not buffer all matching right side rows when > bound condition is empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32731) Added tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse
Muhammad Samir Khan created SPARK-32731: --- Summary: Added tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse Key: SPARK-32731 URL: https://issues.apache.org/jira/browse/SPARK-32731 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan Splitting tests originally posted in [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added tests cover cases for maps and arrays of nested structs for different file formats. Eg, [https://github.com/apache/spark/pull/29353] and [https://github.com/apache/spark/pull/29354] add object reuse when reading ORC and Avro files. However, for dynamic data structures like arrays and maps, we do not know just by looking at the schema what the size of the data structure will be so it has to be allocated when reading the data points. The added tests provide coverage so that objects are not accidentally reused when encountering maps and arrays. AFAIK this is not covered by existing tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse
[ https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32731: Summary: Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse (was: Added tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse) > Add tests for arrays/maps of nested structs to ReadSchemaSuite to test > structs reuse > > > Key: SPARK-32731 > URL: https://issues.apache.org/jira/browse/SPARK-32731 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Splitting tests originally posted in > [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added > tests cover cases for maps and arrays of nested structs for different file > formats. Eg, [https://github.com/apache/spark/pull/29353] and > [https://github.com/apache/spark/pull/29354] add object reuse when reading > ORC and Avro files. However, for dynamic data structures like arrays and > maps, we do not know just by looking at the schema what the size of the data > structure will be so it has to be allocated when reading the data points. The > added tests provide coverage so that objects are not accidentally reused when > encountering maps and arrays. > AFAIK this is not covered by existing tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse
[ https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32731: Assignee: Apache Spark > Add tests for arrays/maps of nested structs to ReadSchemaSuite to test > structs reuse > > > Key: SPARK-32731 > URL: https://issues.apache.org/jira/browse/SPARK-32731 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Assignee: Apache Spark >Priority: Major > > Splitting tests originally posted in > [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added > tests cover cases for maps and arrays of nested structs for different file > formats. Eg, [https://github.com/apache/spark/pull/29353] and > [https://github.com/apache/spark/pull/29354] add object reuse when reading > ORC and Avro files. However, for dynamic data structures like arrays and > maps, we do not know just by looking at the schema what the size of the data > structure will be so it has to be allocated when reading the data points. The > added tests provide coverage so that objects are not accidentally reused when > encountering maps and arrays. > AFAIK this is not covered by existing tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse
[ https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186825#comment-17186825 ] Apache Spark commented on SPARK-32731: -- User 'msamirkhan' has created a pull request for this issue: https://github.com/apache/spark/pull/29573 > Add tests for arrays/maps of nested structs to ReadSchemaSuite to test > structs reuse > > > Key: SPARK-32731 > URL: https://issues.apache.org/jira/browse/SPARK-32731 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Splitting tests originally posted in > [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added > tests cover cases for maps and arrays of nested structs for different file > formats. Eg, [https://github.com/apache/spark/pull/29353] and > [https://github.com/apache/spark/pull/29354] add object reuse when reading > ORC and Avro files. However, for dynamic data structures like arrays and > maps, we do not know just by looking at the schema what the size of the data > structure will be so it has to be allocated when reading the data points. The > added tests provide coverage so that objects are not accidentally reused when > encountering maps and arrays. > AFAIK this is not covered by existing tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse
[ https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32731: Assignee: (was: Apache Spark) > Add tests for arrays/maps of nested structs to ReadSchemaSuite to test > structs reuse > > > Key: SPARK-32731 > URL: https://issues.apache.org/jira/browse/SPARK-32731 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Splitting tests originally posted in > [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added > tests cover cases for maps and arrays of nested structs for different file > formats. Eg, [https://github.com/apache/spark/pull/29353] and > [https://github.com/apache/spark/pull/29354] add object reuse when reading > ORC and Avro files. However, for dynamic data structures like arrays and > maps, we do not know just by looking at the schema what the size of the data > structure will be so it has to be allocated when reading the data points. The > added tests provide coverage so that objects are not accidentally reused when > encountering maps and arrays. > AFAIK this is not covered by existing tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32731) Add tests for arrays/maps of nested structs to ReadSchemaSuite to test structs reuse
[ https://issues.apache.org/jira/browse/SPARK-32731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186826#comment-17186826 ] Apache Spark commented on SPARK-32731: -- User 'msamirkhan' has created a pull request for this issue: https://github.com/apache/spark/pull/29573 > Add tests for arrays/maps of nested structs to ReadSchemaSuite to test > structs reuse > > > Key: SPARK-32731 > URL: https://issues.apache.org/jira/browse/SPARK-32731 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Splitting tests originally posted in > [PR|[https://github.com/apache/spark/pull/29352]] for SPARK-32531. The added > tests cover cases for maps and arrays of nested structs for different file > formats. Eg, [https://github.com/apache/spark/pull/29353] and > [https://github.com/apache/spark/pull/29354] add object reuse when reading > ORC and Avro files. However, for dynamic data structures like arrays and > maps, we do not know just by looking at the schema what the size of the data > structure will be so it has to be allocated when reading the data points. The > added tests provide coverage so that objects are not accidentally reused when > encountering maps and arrays. > AFAIK this is not covered by existing tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32732) Convert schema only once in OrcSerializer
Muhammad Samir Khan created SPARK-32732: --- Summary: Convert schema only once in OrcSerializer Key: SPARK-32732 URL: https://issues.apache.org/jira/browse/SPARK-32732 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan This is to track a TODO item in [Pull Request|[https://github.com/apache/spark/pull/29352]] for SPARK-32532 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies
[ https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186846#comment-17186846 ] Vladimir Matveev commented on SPARK-32385: -- Sorry for the delayed response! > This requires us fixing every version of every transitive dependency. How > does that get updated as the transitive dependency graph changes? this > exchanges one problem for another I think. That is, we are definitely not > trying to fix dependency versions except where necessary. I don't think this is right — you don't have to fix more than just direct dependencies, like you already do. It's pretty much the same thing as defining the version numbers like [here|https://github.com/apache/spark/blob/a0bd273bb04d9a5684e291ec44617972dcd4accd/pom.xml#L121-L197] and then declaring specific dependencies with the versions below. It's just it is done slightly differently, by using Maven's `` mechanism and POM inheritance (for Maven; for Gradle e.g. it would be this "platform" thing). > Gradle isn't something that this project supports, but, wouldn't this be a > much bigger general problem if its resolution rules are different from Maven? > that is, surely gradle can emulate Maven if necessary. I don't think Gradle can emulate Maven, and I personally don't think it should, because Maven's strategy for conflict resolution is quite unconventional, and is not used by most of the dependency management tools, not just in the Java world. Also, I naturally don't have statistics, so this is just my speculation, but it seems likely to me that most of the downstream projects which use Spark don't actually use Maven for dependency management, especially given its Scala heritage. Therefore, they can't take advantage of Maven's dependency resolution algorithm and the current Spark's POM configuration. Also I'd like to point out again that this whole BOM mechanism is something which _Maven_ supports natively, it's not a Gradle extension or something. The BOM concept originated in Maven, and it is declared using Maven's {{}} block, which is a part of POM syntax. Hopefully this would reduce some of the concerns about it. > Publish a "bill of materials" (BOM) descriptor for Spark with correct > versions of various dependencies > -- > > Key: SPARK-32385 > URL: https://issues.apache.org/jira/browse/SPARK-32385 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Vladimir Matveev >Priority: Major > > Spark has a lot of dependencies, many of them very common (e.g. Guava, > Jackson). Also, versions of these dependencies are not updated as frequently > as they are released upstream, which is totally understandable and natural, > but which also means that often Spark has a dependency on a lower version of > a library, which is incompatible with a higher, more recent version of the > same library. This incompatibility can manifest in different ways, e.g as > classpath errors or runtime check errors (like with Jackson), in certain > cases. > > Spark does attempt to "fix" versions of its dependencies by declaring them > explicitly in its {{pom.xml}} file. However, this approach, being somewhat > workable if the Spark-using project itself uses Maven, breaks down if another > build system is used, like Gradle. The reason is that Maven uses an > unconventional "nearest first" version conflict resolution strategy, while > many other tools like Gradle use the "highest first" strategy which resolves > the highest possible version number inside the entire graph of dependencies. > This means that other dependencies of the project can pull a higher version > of some dependency, which is incompatible with Spark. > > One example would be an explicit or a transitive dependency on a higher > version of Jackson in the project. Spark itself depends on several modules of > Jackson; if only one of them gets a higher version, and others remain on the > lower version, this will result in runtime exceptions due to an internal > version check in Jackson. > > A widely used solution for this kind of version issues is publishing of a > "bill of materials" descriptor (see here: > [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html] > and here: > [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). > This descriptor would contain all versions of all dependencies of Spark; then > downstream projects will be able to use their build system's support for BOMs > to enforce version constraints required for Spark to function correctly. > > One example of successful implementation of the BOM-based approach is Spring: > [https://w
[jira] [Comment Edited] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies
[ https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186846#comment-17186846 ] Vladimir Matveev edited comment on SPARK-32385 at 8/28/20, 11:37 PM: - [~srowen] Sorry for the delayed response! > This requires us fixing every version of every transitive dependency. How > does that get updated as the transitive dependency graph changes? this > exchanges one problem for another I think. That is, we are definitely not > trying to fix dependency versions except where necessary. I don't think this is right — you don't have to fix more than just direct dependencies, like you already do. It's pretty much the same thing as defining the version numbers like [here|https://github.com/apache/spark/blob/a0bd273bb04d9a5684e291ec44617972dcd4accd/pom.xml#L121-L197] and then declaring specific dependencies with the versions below. It's just it is done slightly differently, by using Maven's {{}} mechanism and POM inheritance (for Maven; for Gradle e.g. it would be this "platform" thing). > Gradle isn't something that this project supports, but, wouldn't this be a > much bigger general problem if its resolution rules are different from Maven? > that is, surely gradle can emulate Maven if necessary. I don't think Gradle can emulate Maven, and I personally don't think it should, because Maven's strategy for conflict resolution is quite unconventional, and is not used by most of the dependency management tools, not just in the Java world. Also, I naturally don't have statistics, so this is just my speculation, but it seems likely to me that most of the downstream projects which use Spark don't actually use Maven for dependency management, especially given its Scala heritage. Therefore, they can't take advantage of Maven's dependency resolution algorithm and the current Spark's POM configuration. Also I'd like to point out again that this whole BOM mechanism is something which _Maven_ supports natively, it's not a Gradle extension or something. The BOM concept originated in Maven, and it is declared using Maven's {{}} block, which is a part of POM syntax. Hopefully this would reduce some of the concerns about it. was (Author: netvl): Sorry for the delayed response! > This requires us fixing every version of every transitive dependency. How > does that get updated as the transitive dependency graph changes? this > exchanges one problem for another I think. That is, we are definitely not > trying to fix dependency versions except where necessary. I don't think this is right — you don't have to fix more than just direct dependencies, like you already do. It's pretty much the same thing as defining the version numbers like [here|https://github.com/apache/spark/blob/a0bd273bb04d9a5684e291ec44617972dcd4accd/pom.xml#L121-L197] and then declaring specific dependencies with the versions below. It's just it is done slightly differently, by using Maven's `` mechanism and POM inheritance (for Maven; for Gradle e.g. it would be this "platform" thing). > Gradle isn't something that this project supports, but, wouldn't this be a > much bigger general problem if its resolution rules are different from Maven? > that is, surely gradle can emulate Maven if necessary. I don't think Gradle can emulate Maven, and I personally don't think it should, because Maven's strategy for conflict resolution is quite unconventional, and is not used by most of the dependency management tools, not just in the Java world. Also, I naturally don't have statistics, so this is just my speculation, but it seems likely to me that most of the downstream projects which use Spark don't actually use Maven for dependency management, especially given its Scala heritage. Therefore, they can't take advantage of Maven's dependency resolution algorithm and the current Spark's POM configuration. Also I'd like to point out again that this whole BOM mechanism is something which _Maven_ supports natively, it's not a Gradle extension or something. The BOM concept originated in Maven, and it is declared using Maven's {{}} block, which is a part of POM syntax. Hopefully this would reduce some of the concerns about it. > Publish a "bill of materials" (BOM) descriptor for Spark with correct > versions of various dependencies > -- > > Key: SPARK-32385 > URL: https://issues.apache.org/jira/browse/SPARK-32385 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Vladimir Matveev >Priority: Major > > Spark has a lot of dependencies, many of them very common (e.g. Guava, > Jackson). Also, versions of these dependencies are not updated as freq
[jira] [Commented] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies
[ https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186850#comment-17186850 ] Sean R. Owen commented on SPARK-32385: -- OK, so it's just a different theory of organizing the Maven dependencies? that could be OK to clean up, but I think we'd have to see a prototype of some of the change to understand what it's going to mean. I get it, BOMs are just a design pattern in Maven, not some new tool or thing. I am still not quite sure what this gains if it doesn't change dependency resolution. Is it just that you declare one artifact POM to depend on that declares a bunch of dependent versions, so people don't go depending on different versions? For Spark artifacts? I mean people can already do that by setting some spark.version property in their build. If it doesn't change transitive dependency handling, what does it do - or does it? I have no opinion on whether closest-first or latest-first resolution is more sound. I think Maven is still probably more widely used but don't particularly care. What is important is: if we change the build and it changes Spark's transitive dependencies for downstream users, that could be a breaking change. Or: anything we can do to make the dependency resolution consistent across SBT, Gradle etc is a win for sure. > Publish a "bill of materials" (BOM) descriptor for Spark with correct > versions of various dependencies > -- > > Key: SPARK-32385 > URL: https://issues.apache.org/jira/browse/SPARK-32385 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Vladimir Matveev >Priority: Major > > Spark has a lot of dependencies, many of them very common (e.g. Guava, > Jackson). Also, versions of these dependencies are not updated as frequently > as they are released upstream, which is totally understandable and natural, > but which also means that often Spark has a dependency on a lower version of > a library, which is incompatible with a higher, more recent version of the > same library. This incompatibility can manifest in different ways, e.g as > classpath errors or runtime check errors (like with Jackson), in certain > cases. > > Spark does attempt to "fix" versions of its dependencies by declaring them > explicitly in its {{pom.xml}} file. However, this approach, being somewhat > workable if the Spark-using project itself uses Maven, breaks down if another > build system is used, like Gradle. The reason is that Maven uses an > unconventional "nearest first" version conflict resolution strategy, while > many other tools like Gradle use the "highest first" strategy which resolves > the highest possible version number inside the entire graph of dependencies. > This means that other dependencies of the project can pull a higher version > of some dependency, which is incompatible with Spark. > > One example would be an explicit or a transitive dependency on a higher > version of Jackson in the project. Spark itself depends on several modules of > Jackson; if only one of them gets a higher version, and others remain on the > lower version, this will result in runtime exceptions due to an internal > version check in Jackson. > > A widely used solution for this kind of version issues is publishing of a > "bill of materials" descriptor (see here: > [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html] > and here: > [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). > This descriptor would contain all versions of all dependencies of Spark; then > downstream projects will be able to use their build system's support for BOMs > to enforce version constraints required for Spark to function correctly. > > One example of successful implementation of the BOM-based approach is Spring: > [https://www.baeldung.com/spring-maven-bom#spring-bom]. For different Spring > projects, e.g. Spring Boot, there are BOM descriptors published which can be > used in downstream projects to fix the versions of Spring components and > their dependencies, significantly reducing confusion around proper version > numbers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19256) Hive bucketing write support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-19256: - Affects Version/s: 3.1.0 > Hive bucketing write support > > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0 >Reporter: Tejas Patil >Priority: Minor > > Update (2020 by Cheng Su): > We use this JIRA to track progress for Hive bucketing write support in Spark. > The goal is for Spark to write Hive bucketed table, to be compatible with > other compute engines (Hive and Presto). > > Current status for Hive bucketed table in Spark: > Not support for reading Hive bucketed table: read bucketed table as > non-bucketed table. > Wrong behavior for writing Hive ORC and Parquet bucketed table: write > orc/parquet bucketed table as non-bucketed table (code path: > InsertIntoHadoopFsRelationCommand -> FileFormatWriter). > Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception > by default if writing non-orc/parquet bucketed table (code path: > InsertIntoHiveTable), and exception can be disabled by setting config > `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will > write as non-bucketed table. > > Current status for Hive bucketed table in Hive: > Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash > (https://issues.apache.org/jira/browse/HIVE-18910). > Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash. > Hive on Tez: support zero and multiple files per bucket > (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on > read path - > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212] > . > > Current status for Hive bucketed table in Presto (take presto-sql here): > Support writing bucketed table with Hive murmur3hash and hivehash > ([https://github.com/prestosql/presto/pull/1697]). > Support zero and multiple files per bucket > ([https://github.com/prestosql/presto/pull/822]). > > TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and > Hive. Here with this JIRA, we need to add support writing Hive bucketed table > with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and > 2.x.y). > > To allow Spark efficiently read Hive bucketed table, this needs more radical > change and we decide to wait until data source v2 supports bucketing, and do > the read path on data source v2. Read path will not covered by this JIRA. > > Original description (2017 by Tejas Patil): > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property
[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186853#comment-17186853 ] Apache Spark commented on SPARK-32693: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29575 > Compare two dataframes with same schema except nullable property > > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.0 >Reporter: david bernuau >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.1.0 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > > > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > > > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property
[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186854#comment-17186854 ] Apache Spark commented on SPARK-32693: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29575 > Compare two dataframes with same schema except nullable property > > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.0 >Reporter: david bernuau >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.1.0 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > > > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > > > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies
[ https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186855#comment-17186855 ] Vladimir Matveev commented on SPARK-32385: -- > I am still not quite sure what this gains if it doesn't change dependency > resolution. It does not change dependency resolution by itself, it adds an ability for the user, if they want to, to automatically lock onto versions explicitly declared by the Spark project. So yeah, this: > Is it just that you declare one artifact POM to depend on that declares a > bunch of dependent versions, so people don't go depending on different > versions? pretty much summarizes it; this could be expanded to say that (depending on the build tool) it may also enforce these dependent versions in case of conflicts. > I mean people can already do that by setting some spark.version property in > their build. They can't in general, because while it will enforce the Spark's own version, it won't necessarily determine the versions of transitive dependencies. The latter will only happen when the consumer also uses Maven, and when they have a particular order of dependencies in their POM declaration (e.g. no newer Jackson version declared transitively lexically earlier than Spark). > What is important is: if we change the build and it changes Spark's > transitive dependencies for downstream users, that could be a breaking change. My understanding is that this should not happen, unless the user explicitly opts into using the BOM, in which case it arguably changes the situation for the better in most cases, because now versions are guaranteed to align with Spark's declarations. > Publish a "bill of materials" (BOM) descriptor for Spark with correct > versions of various dependencies > -- > > Key: SPARK-32385 > URL: https://issues.apache.org/jira/browse/SPARK-32385 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Vladimir Matveev >Priority: Major > > Spark has a lot of dependencies, many of them very common (e.g. Guava, > Jackson). Also, versions of these dependencies are not updated as frequently > as they are released upstream, which is totally understandable and natural, > but which also means that often Spark has a dependency on a lower version of > a library, which is incompatible with a higher, more recent version of the > same library. This incompatibility can manifest in different ways, e.g as > classpath errors or runtime check errors (like with Jackson), in certain > cases. > > Spark does attempt to "fix" versions of its dependencies by declaring them > explicitly in its {{pom.xml}} file. However, this approach, being somewhat > workable if the Spark-using project itself uses Maven, breaks down if another > build system is used, like Gradle. The reason is that Maven uses an > unconventional "nearest first" version conflict resolution strategy, while > many other tools like Gradle use the "highest first" strategy which resolves > the highest possible version number inside the entire graph of dependencies. > This means that other dependencies of the project can pull a higher version > of some dependency, which is incompatible with Spark. > > One example would be an explicit or a transitive dependency on a higher > version of Jackson in the project. Spark itself depends on several modules of > Jackson; if only one of them gets a higher version, and others remain on the > lower version, this will result in runtime exceptions due to an internal > version check in Jackson. > > A widely used solution for this kind of version issues is publishing of a > "bill of materials" descriptor (see here: > [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html] > and here: > [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). > This descriptor would contain all versions of all dependencies of Spark; then > downstream projects will be able to use their build system's support for BOMs > to enforce version constraints required for Spark to function correctly. > > One example of successful implementation of the BOM-based approach is Spring: > [https://www.baeldung.com/spring-maven-bom#spring-bom]. For different Spring > projects, e.g. Spring Boot, there are BOM descriptors published which can be > used in downstream projects to fix the versions of Spring components and > their dependencies, significantly reducing confusion around proper version > numbers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark
[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property
[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186857#comment-17186857 ] Apache Spark commented on SPARK-32693: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29576 > Compare two dataframes with same schema except nullable property > > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.0 >Reporter: david bernuau >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.1.0 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > > > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > > > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property
[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186859#comment-17186859 ] Apache Spark commented on SPARK-32693: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29576 > Compare two dataframes with same schema except nullable property > > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.0 >Reporter: david bernuau >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.1.0 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > > > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > > > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org