[jira] [Created] (SPARK-43973) Structured Streaming UI should display failed queries correctly
Kris Mok created SPARK-43973: Summary: Structured Streaming UI should display failed queries correctly Key: SPARK-43973 URL: https://issues.apache.org/jira/browse/SPARK-43973 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.0 Reporter: Kris Mok The Structured Streaming UI is designed to be able to show a query's status (active/finished/failed) and if failed, the error message. Due to a bug in the implementation, the error message in {{QueryTerminatedEvent}} isn't being tracked by the UI data, so in turn the UI always shows failed queries as "finished". Example: {code:scala} implicit val ctx = spark.sqlContext import org.apache.spark.sql.execution.streaming.MemoryStream spark.conf.set("spark.sql.ansi.enabled", "true") val inputData = MemoryStream[(Int, Int)] val df = inputData.toDF().selectExpr("_1 / _2 as a") inputData.addData((1, 2), (3, 4), (5, 6), (7, 0)) val testQuery = df.writeStream.format("memory").queryName("kristest").outputMode("append").start testQuery.processAllAvailable() {code} Here we intentionally fail a query, but the Spark UI's Structured Streaming tab would show this as "FINISHED" without any errors, which is wrong. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42851) EquivalentExpressions methods need to be consistently guarded by supportedExpression
Kris Mok created SPARK-42851: Summary: EquivalentExpressions methods need to be consistently guarded by supportedExpression Key: SPARK-42851 URL: https://issues.apache.org/jira/browse/SPARK-42851 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.2, 3.4.0 Reporter: Kris Mok SPARK-41468 tried to fix a bug but introduced a new regression. Its change to {{EquivalentExpressions}} added a {{supportedExpression()}} guard to the {{addExprTree()}} and {{getExprState()}} methods, but didn't add the same guard to the other "add" entry point -- {{addExpr()}}. As such, uses that add single expressions to CSE via {{addExpr()}} may succeed, but upon retrieval via {{getExprState()}} it'd inconsistently get a {{None}} due to failing the guard. We need to make sure the "add" and "get" methods are consistent. It could be done by one of: 1. Adding the same {{supportedExpression()}} guard to {{addExpr()}}, or 2. Removing the guard from {{getExprState()}}, relying solely on the guard on the "add" path to make sure only intended state is added. (or other alternative refactorings to fuse the guard into various methods to make it more efficient) There are pros and cons to the two directions above, because {{addExpr()}} used to allow (potentially incorrect) more expressions to get CSE'd, making it more restrictive may cause performance regressions (for the cases that happened to work). Example: {code:sql} select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) {code} Running this query on Spark 3.2 branch returns the correct value: {code} scala> spark.sql("select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2)").collect res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(1),WrappedArray(1)]) {code} Here, {{transform(array(id), x -> x)}} is an {{AggregateExpression}} that was (potentially unsafely) recognized by {{addExpr()}} as a common subexpression, and {{getExprState()}} doesn't do extra guarding, so during physical planning, in {{PhysicalAggregation}} this expression gets CSE'd in both the aggregation expression list and the result expressions list. {code} AdaptiveSparkPlan isFinalPlan=false +- SortAggregate(key=[], functions=[max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11] +- SortAggregate(key=[], functions=[partial_max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))]) +- Range (0, 2, step=1, splits=16) {code} Running the same query on current master triggers an error when binding the result expression to the aggregate expression in the Aggregate operators (for a WSCG-enabled operator like {{HashAggregateExec}}, the same error would show up during codegen): {code} ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 16) (ip-10-110-16-93.us-west-2.compute.internal executor driver): java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3] at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:517) at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249) at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1248) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:532) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:517) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:456) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) at org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) at
[jira] [Created] (SPARK-40380) Constant-folding of InvokeLike should not result in non-serializable result
Kris Mok created SPARK-40380: Summary: Constant-folding of InvokeLike should not result in non-serializable result Key: SPARK-40380 URL: https://issues.apache.org/jira/browse/SPARK-40380 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Kris Mok SPARK-37907 added constant-folding support to the {{InvokeLike}} family of expressions. Unfortunately it introduced a regression for cases when a constant-folded {{InvokeLike}} expression returned a non-serializable result. {{ExpressionEncoder}}s is an area where this problem may be exposed, e.g. when using sparksql-scalapb on Spark 3.3.0+. Below is a minimal repro to demonstrate this issue: {code:scala} import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions.Literal import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, StaticInvoke} import org.apache.spark.sql.types.{LongType, ObjectType} class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = longVal + other } case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) } val litExpr = Literal.fromObject(SerializableBoxedLong(42L), ObjectType(classOf[SerializableBoxedLong])) val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", ObjectType(classOf[NotSerializableBoxedLong])) val addExpr = Invoke(toNotSerializableExpr, "add", LongType, Seq(UnresolvedAttribute.quotedString("id"))) val df = spark.range(2).select(new Column(addExpr)) df.collect {code} Before SPARK-37907, this example would run fine and result in {{[[42], [43]]}}. But after SPARK-37907, it'd fail with: {code:none} ... Caused by: java.io.NotSerializableException: NotSerializableBoxedLong Serialization stack: - object not serializable (class: NotSerializableBoxedLong, value: NotSerializableBoxedLong@71231636) - element of array (index: 1) - array (class [Ljava.lang.Object;, size 2) - element of array (index: 1) - array (class [Ljava.lang.Object;, size 3) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=3]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389@185db22c) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599616#comment-17599616 ] Kris Mok commented on SPARK-40303: -- Nice findings [~LuciferYang]! {quote} After some experiments, I found when the number of parameters exceeds 50, the performance of the case in the Jira description will significant deterioration. {quote} Sounds reasonable. Note that in a STD compilation, the only things that need to be live at the method entry are the method parameters (both implicit ones like {{this}}, and explicit ones); however, for an OSR compilation, it would be all of the parameters/local variables that are live at the loop entry point, so in this case both the {{doConsume}} parameters and the local variables contribute to the problem. Just FYI I have an old write on PrintCompilation and OSR here: https://gist.github.com/rednaxelafx/1165804#file-notes-md (Gee, just realized that was from 11 years ago...) {quote} maybe try to make the input parameters of the `doConsume` method fixed length will help, such as using a List or Array {quote} Welp, hoisting the parameters into an Arguments object is rather common in "code splitting" in code generators. Since we're already doing codegen, it's possible to generate tailor-made Arguments classes to retain the type information. Using List/Array would require extra boxing for primitive types and it's less ideal. (An array-based box is already used in Spark SQL's codegen in the form of the {{references}} array. Indeed the type info is lost on the interface level and you'd have to do a cast when you get data out of it. It's still usable though.) > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > 60 count distinct with codegen 628146 628146 >0 0.0 314072.8 1.0X > 60 count distinct without codegen147635 147635 >0 0.0 73817.5 4.3X > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599302#comment-17599302 ] Kris Mok commented on SPARK-40303: -- Interesting, thanks for posting your findings, [~LuciferYang]! In the JDK8 case, the message: {code} COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} Indicates that the compiler deemed this method should not be attempted to compile again on any tier of compilation, and because this is an OSR compilation (i.e. loop compilation), this will mark the method as "never try to perform OSR compilation again on all tiers". On JDK17 this is relaxed a bit and it'll try tier 1. Note that OSR compilation and STD compilation (normal compilation) are separate things. Marking one as not compilation doesn't affect the other. I haven't checked the IR yet but if I had to guess, the reason why this method is recorded as not OSR compilable is because there are too many live local variables at the OSR (loop) entry point, beyond what the HotSpot JVM could support. So only OSR is affected, STD should still be fine. In general, the tiered compilation system in the HotSpot JVM works as: - tier 0: interpreter - tier 1: C1 no profiling (best code quality for C1, same as HotSpot Client VM's C1) - tier 2: C1 basic profiling (lower performance than tier 1, only used when the target level is tier 3 but the C1 queue is too long) - tier 3: C1 full profiling (lower performance than tier 2, for collecting profile to perform tier 4 profile-guided optimization) - tier 4: C2 As such, tiers 2 and 3 are only useful if tier 4 is available. So if a method is recorded as "not compilable on tier 4", the only realistic option left is to try tier 1. > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > 60 count distinct with codegen 628146 628146 >0 0.0 314072.8 1.0X > 60 count distinct without codegen147635 147635 >0 0.0 73817.5 4.3X > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39839) Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check
Kris Mok created SPARK-39839: Summary: Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check Key: SPARK-39839 URL: https://issues.apache.org/jira/browse/SPARK-39839 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.2.0, 3.1.0 Reporter: Kris Mok The {{UnsafeRow}} structural integrity check in {{UnsafeRowUtils.validateStructuralIntegrity}} is added in Spark 3.1.0. It’s supposed to validate that a given {{UnsafeRow}} conforms to the format that the {{UnsafeRowWriter}} would have produced. Currently the check expects all fields that are marked as null should also have its field (i.e. the fixed-length part) set to all zeros. It needs to be updated to handle a special case for variable-length {{{}Decimal{}}}s, where the {{UnsafeRowWriter}} may mark a field as null but also leave the fixed-length part of the field as {{OffsetAndSize(offset=current_offset, size=0)}}. This may happen when the {{Decimal}} being written is either a real {{null}} or has overflowed the specified precision. Logic in {{UnsafeRowWriter}}: in general: {code:scala} public void setNullAt(int ordinal) { BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit write(ordinal, 0L); // also zero out the fixed-length field } {code} special case for {{DecimalType}}: {code:scala} // Make sure Decimal object has the same scale as DecimalType. // Note that we may pass in null Decimal object to set null for it. if (input == null || !input.changePrecision(precision, scale)) { BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit // keep the offset for future update setOffsetAndSize(ordinal, 0); // doesn't zero out the fixed-length field } {code} The special case is introduced to allow all {{DecimalType}}s (including both fixed-length and variable-length ones) to be mutable – thus need to leave space for the variable-length field even if it’s currently null. Note that this special case in {{UnsafeRowWriter}} has been there since Spark 1.6.0, where as the integrity check was added in Spark 3.1.0. The check was originally added for Structured Streaming’s checkpoint evolution validation, so that a newer version of Spark can check whether or not an older checkpoint file for Structured Streaming queries can be supported, and/or if the contents of the checkpoint file is corrupted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39140) JavaSerializer doesn't serialize the fields of superclass
[ https://issues.apache.org/jira/browse/SPARK-39140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534633#comment-17534633 ] Kris Mok commented on SPARK-39140: -- This isn't a "bug" in Spark's {{JavaSerializer}} per-se, but a feature that's working by-design that could be overlooked by the users of this class. The underlying issue can be demonstrated with a simple tweak to the code snippet in the issue description: {code:scala} abstract class AA extends Serializable { val ts = System.nanoTime() } case class BB(x: Int) extends AA { } {code} Now run the rest of the original code snipper and you should see that {{input.ts}} and {{obj1.ts}} are now the same, i.e. it's correctly serialized and then deserialized. What is it telling us? Well, in Scala, {{case class}} imply implementing the {{scala.Serializable}} trait, which has the effect of {{java.io.Serializable}}. The {{JavaSerializer}} uses Java's standard library's builtin serialization mechanism, which specifies that only fields on {{Serializable}} classes are serialized by default. Effectively, you can have a subclass that is serializable, but fields on all of its non-serializable supertypes will just be ignored. There are ways to customize the Java serialization mechanism to make subclasses also write out fields from the superclass, but there's no builtin way to do that -- you have to roll your own code. Scala doesn't help here either. Reflection-based third-party serializers, like Kryo in this example, usually ignores the language-level serialization markers and just serializes everything -- except for those explicitly marked as transient (excluded from serialization). That's how {{KryoSerializer}} "works" here. https://docs.oracle.com/javase/8/docs/platform/serialization/spec/serial-arch.html#a4176 {quote}Special handling is required for arrays, enum constants, and objects of type Class, ObjectStreamClass, and String. Other objects must implement either the Serializable or the Externalizable interface to be saved in or restored from a stream.{quote} and https://docs.oracle.com/javase/8/docs/platform/serialization/spec/output.html#a861 {quote}Each subclass of a serializable object may define its own writeObject method. If a class does not implement the method, the default serialization provided by defaultWriteObject will be used. When implemented, the class is only responsible for writing its own fields, not those of its supertypes or subtypes.{quote} and https://docs.oracle.com/javase/8/docs/api/java/io/ObjectOutputStream.html {quote}Serialization does not write out the fields of any object that does not implement the java.io.Serializable interface. Subclasses of Objects that are not serializable can be serializable. In this case the non-serializable class must have a no-arg constructor to allow its fields to be initialized. In this case it is the responsibility of the subclass to save and restore the state of the non-serializable class. It is frequently the case that the fields of that class are accessible (public, package, or protected) or that there are get and set methods that can be used to restore the state.{quote} > JavaSerializer doesn't serialize the fields of superclass > - > > Key: SPARK-39140 > URL: https://issues.apache.org/jira/browse/SPARK-39140 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Priority: Major > > To reproduce: > > {code:java} > abstract class AA { > val ts = System.nanoTime() > } > case class BB(x: Int) extends AA { > } > val input = BB(1) > println("original ts: " + input.ts) > val javaSerializer = new JavaSerializer(new SparkConf()) > val javaInstance = javaSerializer.newInstance() > val bytes1 = javaInstance.serialize[BB](input) > val obj1 = javaInstance.deserialize[BB](bytes1) > println("deserialization result from java: " + obj1.ts) > val kryoSerializer = new KryoSerializer(new SparkConf()) > val kryoInstance = kryoSerializer.newInstance() > val bytes2 = kryoInstance.serialize[BB](input) > val obj2 = kryoInstance.deserialize[BB](bytes2) > println("deserialization result from kryo: " + obj2.ts) {code} > > > The output is > > {code:java} > original ts: 115014173658666 > deserialization result from java: 115014306794333 > deserialization result from kryo: 115014173658666{code} > > We can see that the fields from the superclass AA are not serialized with > JavaSerializer. When switching to KryoSerializer, it works. > This caused bugs in the project SPARK-38615: TreeNode.origin with actual > information is not serialized to executors when a plan can't be executed with > whole-staged-codegen. > It could also lead to bugs in serializing the lambda function within RDD API > like >
[jira] [Updated] (SPARK-34607) NewInstance.resolved should not throw malformed class name error
[ https://issues.apache.org/jira/browse/SPARK-34607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kris Mok updated SPARK-34607: - Description: I'd like to seek for community help on fixing this issue: Related to SPARK-34596, one of our users had hit an issue with {{ExpressionEncoder}} when running Spark code in a Scala REPL, where {{NewInstance.resolved}} was throwing {{"Malformed class name"}} error, with the following kind of stack trace: {code} java.lang.InternalError: Malformed class name at java.lang.Class.getSimpleBinaryName(Class.java:1450) at java.lang.Class.isMemberClass(Class.java:1433) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441) at org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionBottomUp(Analyzer.scala:1935) ... Caused by: sbt.ForkMain$ForkError: java.lang.StringIndexOutOfBoundsException: String index out of range: -83 at java.lang.String.substring(String.java:1931) at java.lang.Class.getSimpleBinaryName(Class.java:1448) at java.lang.Class.isMemberClass(Class.java:1433) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441) ... {code} The most important point in the stack trace is this: {code} java.lang.InternalError: Malformed class name at java.lang.Class.getSimpleBinaryName(Class.java:1450) at java.lang.Class.isMemberClass(Class.java:1433) {code} The most common way to hit the {{"Malformed class name"}} issue in Spark is via {{java.lang.Class.getSimpleName}}. But as this stack trace demonstrates, it can happen via other code paths from the JDK as well. If we want to fix it in a similar fashion as {{Utils.getSimpleName}}, we'd have to emulate {{java.lang.Class.isMemberClass}} in Spark's {{Utils}}, and then use it in the {{NewInstance.resolved}} code path. Here's a reproducer test case (in diff form against Spark master's {{ExpressionEncoderSuite}} ), which uses explicit nested classes to emulate the code structure that'd be generated by Scala's REPL: (the level of nesting can be further reduced and still reproduce the issue) {code} diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala index 2635264..fd1b23d 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala @@ -217,6 +217,95 @@ class ExpressionEncoderSuite extends CodegenInterpretedPlanTest with AnalysisTes "nested Scala class should work") } + object OuterLevelWithVeryVeryVeryLongClassName1 { +object OuterLevelWithVeryVeryVeryLongClassName2 { + object OuterLevelWithVeryVeryVeryLongClassName3 { +object OuterLevelWithVeryVeryVeryLongClassName4 { + object OuterLevelWithVeryVeryVeryLongClassName5 { +object OuterLevelWithVeryVeryVeryLongClassName6 { + object OuterLevelWithVeryVeryVeryLongClassName7 { +object OuterLevelWithVeryVeryVeryLongClassName8 { + object OuterLevelWithVeryVeryVeryLongClassName9 { +object OuterLevelWithVeryVeryVeryLongClassName10 { + object OuterLevelWithVeryVeryVeryLongClassName11 { +object OuterLevelWithVeryVeryVeryLongClassName12 { + object OuterLevelWithVeryVeryVeryLongClassName13 { +object OuterLevelWithVeryVeryVeryLongClassName14 { + object OuterLevelWithVeryVeryVeryLongClassName15 { +object OuterLevelWithVeryVeryVeryLongClassName16 { + object OuterLevelWithVeryVeryVeryLongClassName17 { +object OuterLevelWithVeryVeryVeryLongClassName18 { + object OuterLevelWithVeryVeryVeryLongClassName19 { +object OuterLevelWithVeryVeryVeryLongClassName20 { + case class MalformedNameExample2(x: Int) +} + } +} + } +} + } +} + } +} + } +} +
[jira] [Updated] (SPARK-34607) NewInstance.resolved should not throw malformed class name error
[ https://issues.apache.org/jira/browse/SPARK-34607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kris Mok updated SPARK-34607: - Description: I'd like to seek for community help on fixing this issue: Related to SPARK-34596, one of our users had hit an issue with {{ExpressionEncoder}} when running Spark code in a Scala REPL, where {{NewInstance.resolved}} was throwing {{"Malformed class name"}} error, with the following kind of stack trace: {code} java.lang.InternalError: Malformed class name at java.lang.Class.getSimpleBinaryName(Class.java:1450) at java.lang.Class.isMemberClass(Class.java:1433) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441) at org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionBottomUp(Analyzer.scala:1935) ... Caused by: sbt.ForkMain$ForkError: java.lang.StringIndexOutOfBoundsException: String index out of range: -83 at java.lang.String.substring(String.java:1931) at java.lang.Class.getSimpleBinaryName(Class.java:1448) at java.lang.Class.isMemberClass(Class.java:1433) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441) ... {code} The most important point in the stack trace is this: {code} java.lang.InternalError: Malformed class name at java.lang.Class.getSimpleBinaryName(Class.java:1450) at java.lang.Class.isMemberClass(Class.java:1433) {code} The most common way to hit the {{"Malformed class name"}} issue in Spark is via {{java.lang.Class.getSimpleName}}. But as this stack trace demonstrates, it can happen via other code paths from the JDK as well. If we want to fix it in a similar fashion as {{Utils.getSimpleName}}, we'd have to emulate {{java.lang.Class.isMemberClass}} in Spark's {{Utils}}, and then use it in the {{NewInstance.resolved}} code path. Here's a reproducer test case (in diff form against Spark master's {{ExpressionEncoderSuite}} ): (the level of nesting can be further reduced and still reproduce the issue) {code} diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala index 2635264..fd1b23d 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala @@ -217,6 +217,95 @@ class ExpressionEncoderSuite extends CodegenInterpretedPlanTest with AnalysisTes "nested Scala class should work") } + object OuterLevelWithVeryVeryVeryLongClassName1 { +object OuterLevelWithVeryVeryVeryLongClassName2 { + object OuterLevelWithVeryVeryVeryLongClassName3 { +object OuterLevelWithVeryVeryVeryLongClassName4 { + object OuterLevelWithVeryVeryVeryLongClassName5 { +object OuterLevelWithVeryVeryVeryLongClassName6 { + object OuterLevelWithVeryVeryVeryLongClassName7 { +object OuterLevelWithVeryVeryVeryLongClassName8 { + object OuterLevelWithVeryVeryVeryLongClassName9 { +object OuterLevelWithVeryVeryVeryLongClassName10 { + object OuterLevelWithVeryVeryVeryLongClassName11 { +object OuterLevelWithVeryVeryVeryLongClassName12 { + object OuterLevelWithVeryVeryVeryLongClassName13 { +object OuterLevelWithVeryVeryVeryLongClassName14 { + object OuterLevelWithVeryVeryVeryLongClassName15 { +object OuterLevelWithVeryVeryVeryLongClassName16 { + object OuterLevelWithVeryVeryVeryLongClassName17 { +object OuterLevelWithVeryVeryVeryLongClassName18 { + object OuterLevelWithVeryVeryVeryLongClassName19 { +object OuterLevelWithVeryVeryVeryLongClassName20 { + case class MalformedNameExample2(x: Int) +} + } +} + } +} + } +} + } +} + } +} + } +} + } +} + } +} + } +} + }
[jira] [Created] (SPARK-34607) NewInstance.resolved should not throw malformed class name error
Kris Mok created SPARK-34607: Summary: NewInstance.resolved should not throw malformed class name error Key: SPARK-34607 URL: https://issues.apache.org/jira/browse/SPARK-34607 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.0.2, 2.4.7 Reporter: Kris Mok I'd like to seek for community help on fixing this issue: Related to SPARK-34596, one of our users had hit an issue with {{ExpressionEncoder}} when running Spark code in a Scala REPL, where {{NewInstance.resolved}} was throwing {{"Malformed class name"}} error, with the following kind of stack trace: {code} java.lang.InternalError: Malformed class name at java.lang.Class.getSimpleBinaryName(Class.java:1450) at java.lang.Class.isMemberClass(Class.java:1433) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441) at org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionBottomUp(Analyzer.scala:1935) ... Caused by: sbt.ForkMain$ForkError: java.lang.StringIndexOutOfBoundsException: String index out of range: -83 at java.lang.String.substring(String.java:1931) at java.lang.Class.getSimpleBinaryName(Class.java:1448) at java.lang.Class.isMemberClass(Class.java:1433) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441) ... {code} The most important point in the stack trace is this: {code} java.lang.InternalError: Malformed class name at java.lang.Class.getSimpleBinaryName(Class.java:1450) at java.lang.Class.isMemberClass(Class.java:1433) {code} The most common way to hit the {{"Malformed class name"}} issue in Spark is via {{java.lang.Class.getSimpleName}}. But as this stack trace demonstrates, it can happen via other code paths from the JDK as well. If we want to fix it in a similar fashion as {{Utils.getSimpleName}}, we'd have to emulate {{java.lang.Class.isMemberClass}} in Spark's {{Utils}}, and then use it in the {{NewInstance.resolved}} code path. Here's a reproducer test case (in diff form against Spark master's {{ExpressionEncoderSuite}} ): {code} diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala index 2635264..fd1b23d 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala @@ -217,6 +217,95 @@ class ExpressionEncoderSuite extends CodegenInterpretedPlanTest with AnalysisTes "nested Scala class should work") } + object OuterLevelWithVeryVeryVeryLongClassName1 { +object OuterLevelWithVeryVeryVeryLongClassName2 { + object OuterLevelWithVeryVeryVeryLongClassName3 { +object OuterLevelWithVeryVeryVeryLongClassName4 { + object OuterLevelWithVeryVeryVeryLongClassName5 { +object OuterLevelWithVeryVeryVeryLongClassName6 { + object OuterLevelWithVeryVeryVeryLongClassName7 { +object OuterLevelWithVeryVeryVeryLongClassName8 { + object OuterLevelWithVeryVeryVeryLongClassName9 { +object OuterLevelWithVeryVeryVeryLongClassName10 { + object OuterLevelWithVeryVeryVeryLongClassName11 { +object OuterLevelWithVeryVeryVeryLongClassName12 { + object OuterLevelWithVeryVeryVeryLongClassName13 { +object OuterLevelWithVeryVeryVeryLongClassName14 { + object OuterLevelWithVeryVeryVeryLongClassName15 { +object OuterLevelWithVeryVeryVeryLongClassName16 { + object OuterLevelWithVeryVeryVeryLongClassName17 { +object OuterLevelWithVeryVeryVeryLongClassName18 { + object OuterLevelWithVeryVeryVeryLongClassName19 { +object OuterLevelWithVeryVeryVeryLongClassName20 { + case class MalformedNameExample2(x: Int) +} + } +} + } +} + } +} + } +} + } +
[jira] [Created] (SPARK-34596) NewInstance.doGenCode should not throw malformed class name error
Kris Mok created SPARK-34596: Summary: NewInstance.doGenCode should not throw malformed class name error Key: SPARK-34596 URL: https://issues.apache.org/jira/browse/SPARK-34596 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2, 2.4.7, 3.1.0 Reporter: Kris Mok Similar to SPARK-32238 and SPARK-32999, the use of {{java.lang.Class.getSimpleName}} in {{NewInstance.doGenCode}} is problematic because Scala classes may trigger {{java.lang.InternalError: Malformed class name}}. This happens more often when using nested classes in Scala (or declaring classes in Scala REPL which implies class nesting). Note that on newer versions of JDK the underlying malformed class name no longer reproduces (fixed in the JDK by https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919), so it's less of an issue there. But on JDK8u this problem still exists so we still have to fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32999) TreeNode.nodeName should not throw malformed class name error
Kris Mok created SPARK-32999: Summary: TreeNode.nodeName should not throw malformed class name error Key: SPARK-32999 URL: https://issues.apache.org/jira/browse/SPARK-32999 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 2.4.0, 3.1.0 Reporter: Kris Mok Similar to SPARK-32238, the use of {{java.lang.Class.getSimpleName}} in {{TreeNode.nodeName}} is problematic because Scala classes may trigger {{java.lang.InternalError: Malformed class name}}. This happens more often when using nested classes in Scala (or declaring classes in Scala REPL which implies class nesting). Note that on newer versions of JDK the underlying malformed class name no longer reproduces, so it's less of an issue there. But on JDK8u this problem still exists so we still have to fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31399) Closure cleaner broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096191#comment-17096191 ] Kris Mok edited comment on SPARK-31399 at 4/30/20, 6:32 AM: Hi [~dongjoon], I've been working on a fix of this issue and will send out a WIP PR as soon as possible. I've pretty much done an analysis of the situation in parallel to [~joshrosen]'s analysis above and have arrived at very similar conclusions. The fact is, Scala 2.12+'s indylambda (aka LMF-based closures) does still have an equivalent of an {{"$outer"}}, just under a different name. Thus the logic inside the {{ClosureCleaner}} for Scala 2.11 support has to be ported basically verbatim to Scala 2.12+/indylambda. That's exactly what I'm working on right now, and it's the main contents of the WIP PR. A separate issue is that the test coverage of {{ClosureCleaner}} in the Spark repo is very insufficient. Neither {{ClosureCleanerSuite}} nor {{ClosureCleanerSuite2}} cover anything related to the Scala REPL. There needs to be a separate suite, similar to {{ReplSuite}}, that fires up an actual Scala REPL and trigger ClosureCleaner in it to bridge the gap in test coverage. I will do that as a second step of the PR, and once the new test suite is in, the PR can be considered complete and ready for final review. was (Author: rednaxelafx): Hi [~dongjoon], I've been working on a fix of this issue and will send out a WIP PR as soon as possible. I've pretty much done an analysis of the situation in parallel to [~joshrosen]'s analysis above and have arrived at very similar conclusions. The fact is, Scala 2.12+'s indylambda (aka LMF-based closures) does still have an equivalent of an "$outer", just under a different name. Thus the logic inside the `ClosureCleaner` for Scala 2.11 support has to be ported basically verbatim to Scala 2.12+/indylambda. That's exactly what I'm working on right now, and it's the main contents of the WIP PR. A separate issue is that the test coverage of ClosureCleaner in the Spark repo is very insufficient. There needs to be a separate suite, similar to `ReplSuite`, that fires up an actual Scala REPL and trigger ClosureCleaner in it to bridge the gap in test coverage. I will do that as a second step of the PR, and once the new test suite is in, the PR can be considered complete and ready for final review. > Closure cleaner broken in Scala 2.12 > > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Wenchen Fan >Assignee: Kris Mok >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > -
[jira] [Commented] (SPARK-31399) Closure cleaner broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096191#comment-17096191 ] Kris Mok commented on SPARK-31399: -- Hi [~dongjoon], I've been working on a fix of this issue and will send out a WIP PR as soon as possible. I've pretty much done an analysis of the situation in parallel to [~joshrosen]'s analysis above and have arrived at very similar conclusions. The fact is, Scala 2.12+'s indylambda (aka LMF-based closures) does still have an equivalent of an "$outer", just under a different name. Thus the logic inside the `ClosureCleaner` for Scala 2.11 support has to be ported basically verbatim to Scala 2.12+/indylambda. That's exactly what I'm working on right now, and it's the main contents of the WIP PR. A separate issue is that the test coverage of ClosureCleaner in the Spark repo is very insufficient. There needs to be a separate suite, similar to `ReplSuite`, that fires up an actual Scala REPL and trigger ClosureCleaner in it to bridge the gap in test coverage. I will do that as a second step of the PR, and once the new test suite is in, the PR can be considered complete and ready for final review. > Closure cleaner broken in Scala 2.12 > > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Wenchen Fan >Assignee: Kris Mok >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at >
[jira] [Created] (SPARK-31240) Constant fold deterministic Scala UDFs with foldable arguments
Kris Mok created SPARK-31240: Summary: Constant fold deterministic Scala UDFs with foldable arguments Key: SPARK-31240 URL: https://issues.apache.org/jira/browse/SPARK-31240 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kris Mok Constant fold deterministic Scala UDFs with foldable arguments, conservatively. ScalaUDFs that meet all following criteria are subject to constant folding in this feature: * deterministic * all arguments are foldable * does not throw an exception when evaluating the UDF for constant folding -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31187) Sort the whole-stage codegen debug output by codegenStageId
Kris Mok created SPARK-31187: Summary: Sort the whole-stage codegen debug output by codegenStageId Key: SPARK-31187 URL: https://issues.apache.org/jira/browse/SPARK-31187 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 3.0.0 Reporter: Kris Mok Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through {{df.queryExecution.debug.codegen}}, or SQL {{explain codegen}} statement. The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This ticket tracks a minor improvement to sort the codegen dump by the {{codegenStageId}}, ascending. After this change, the following query: {code} spark.range(10).agg(sum('id)).queryExecution.debug.codegen {code} will always dump the generated code in a natural, stable order. The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31099) Create migration script for metastore_db
[ https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056549#comment-17056549 ] Kris Mok commented on SPARK-31099: -- Just documenting the fact that users may encounter migration issues when upgrading from earlier versions of Spark to Spark 3.0 due to the Hive profile upgrade sounds good to me. Derby migration is unlikely to be a production issue, and for other databases (MySQL / PG etc) they’re heavy enough that folks would probably realize it’s a Hive metastore migration issue just like what’d happen in Hive. But the documentation should at the very least describe: * upgraded Hive profile * what kind of error messages could occur * links to Hive documentation of how to perform the upgrade WDYT? > Create migration script for metastore_db > > > Key: SPARK-31099 > URL: https://issues.apache.org/jira/browse/SPARK-31099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > When an existing Derby database exists (in ./metastore_db) created by Hive > 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile. > Repro steps: > 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive > -Phive-thriftserver. Make sure there's no existing ./metastore_db directory > in the repo. > 2. Run bin/spark-shell, and then spark.sql("show databases"). This will > populate the ./metastore_db directory, where the Derby-based metastore > database is hosted. This database is populated from Hive 1.2.x. > 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops > the Hive 1.2 profile, which makes it use the default Hive 2.3 profile) > 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby > database created in Step (2), which triggers an upgrade step, and that's > where the following error will be reported. > 5. Delete the ./metastore_db and re-run Step (4). The error is no longer > reported. > {code:java} > 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS > ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN > ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has > been specified as NOT NULL and either the DEFAULT clause was not specified or > was specified as DEFAULT NULL. > java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column > 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT > clause was not specified or was specified as DEFAULT NULL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830) > at > org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1816) > at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744) > at org.datanucleus.store.query.Query.execute(Query.java:1726) > at
[jira] [Created] (SPARK-30795) Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()
Kris Mok created SPARK-30795: Summary: Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s() Key: SPARK-30795 URL: https://issues.apache.org/jira/browse/SPARK-30795 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 3.0.0 Reporter: Kris Mok The {{code()}} string interpolator in Spark SQL's code generator should treat escapes like Scala's builtin {{StringContext.s()}} interpolator, i.e. it should treat escapes in the code parts, and should not treat escapes in the input arguments. For example, {code} val arg = "This is an argument." val str = s"This is string part 1. $arg This is string part 2." val code = code"This is string part 1. $arg This is string part 2." assert(code.toString == str) {code} We should expect the {{code()}} interpolator produce the same thing as the {{StringContext.s()}} interpolator, where only escapes in the string parts should be treated, while the args should be kept verbatim. But in the current implementation, due to the eager folding of code parts and literal input args, the escape treatment is incorrectly done on both code parts and literal args. That causes a problem when an arg contains escape sequences and wants to preserve that in the final produced code string. For example, in {{Like}} expression's codegen, there's an ugly workaround for this bug: {code} // We need double escape to avoid org.codehaus.commons.compiler.CompileException. // '\\' will cause exception 'Single quote must be backslash-escaped in character literal'. // '\"' will cause exception 'Line break in literal not allowed'. val newEscapeChar = if (escapeChar == '\"' || escapeChar == '\\') { s"""\\$escapeChar""" } else { escapeChar } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators
[ https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753310#comment-16753310 ] Kris Mok commented on SPARK-26741: -- Note that there's a similar issue with non-aggregate functions. Here's an example: {code:none} spark.sql("create table foo (id int, blob binary)") val df = spark.sql("select length(blob) from foo where id = 1 order by length(blob) limit 10") df.explain(true) {code} {code:none} == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Sort ['length('blob) ASC NULLS FIRST], true +- 'Project [unresolvedalias('length('blob), None)] +- 'Filter ('id = 1) +- 'UnresolvedRelation `foo` == Analyzed Logical Plan == length(blob): int GlobalLimit 10 +- LocalLimit 10 +- Project [length(blob)#25] +- Sort [length(blob#24) ASC NULLS FIRST], true +- Project [length(blob#24) AS length(blob)#25, blob#24] +- Filter (id#23 = 1) +- SubqueryAlias `default`.`foo` +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24] == Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Project [length(blob)#25] +- Sort [length(blob#24) ASC NULLS FIRST], true +- Project [length(blob#24) AS length(blob)#25, blob#24] +- Filter (isnotnull(id#23) && (id#23 = 1)) +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24] == Physical Plan == TakeOrderedAndProject(limit=10, orderBy=[length(blob#24) ASC NULLS FIRST], output=[length(blob)#25]) +- *(1) Project [length(blob#24) AS length(blob)#25, blob#24] +- *(1) Filter (isnotnull(id#23) && (id#23 = 1)) +- Scan hive default.foo [blob#24, id#23], HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24] {code} Note how the {{Sort}} operator performs the {{length()}} again, despite there's one in the projection right below it. The root cause of this problem in the Analyzer is the same as the main example in this ticket, although this example is not as harmful (at least it still runs...) > Analyzer incorrectly resolves aggregate function outside of Aggregate > operators > --- > > Key: SPARK-26741 > URL: https://issues.apache.org/jira/browse/SPARK-26741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kris Mok >Priority: Major > > The analyzer can sometimes hit issues with resolving functions. e.g. > {code:sql} > select max(id) > from range(10) > group by id > having count(1) >= 1 > order by max(id) > {code} > The analyzed plan of this query is: > {code:none} > == Analyzed Logical Plan == > max(id): bigint > Project [max(id)#91L] > +- Sort [max(id#88L) ASC NULLS FIRST], true >+- Project [max(id)#91L, id#88L] > +- Filter (count(1)#93L >= cast(1 as bigint)) > +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS > count(1)#93L, id#88L] > +- Range (0, 10, step=1, splits=None) > {code} > Note how an aggregate function is outside of {{Aggregate}} operators in the > fully analyzed plan: > {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid. > Trying to run this query will lead to weird issues in codegen, but the root > cause is in the analyzer: > {code:none} > java.lang.UnsupportedOperationException: Cannot generate code for expression: > max(input[1, bigint, false]) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at >
[jira] [Created] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators
Kris Mok created SPARK-26741: Summary: Analyzer incorrectly resolves aggregate function outside of Aggregate operators Key: SPARK-26741 URL: https://issues.apache.org/jira/browse/SPARK-26741 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Kris Mok The analyzer can sometimes hit issues with resolving functions. e.g. {code:sql} select max(id) from range(10) group by id having count(1) >= 1 order by max(id) {code} The analyzed plan of this query is: {code:none} == Analyzed Logical Plan == max(id): bigint Project [max(id)#91L] +- Sort [max(id#88L) ASC NULLS FIRST], true +- Project [max(id)#91L, id#88L] +- Filter (count(1)#93L >= cast(1 as bigint)) +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS count(1)#93L, id#88L] +- Range (0, 10, step=1, splits=None) {code} Note how an aggregate function is outside of {{Aggregate}} operators in the fully analyzed plan: {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid. Trying to run this query will lead to weird issues in codegen, but the root cause is in the analyzer: {code:none} java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[1, bigint, false]) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138) at scala.Option.getOrElse(Option.scala:138) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:237) at scala.collection.TraversableLike.map$(TraversableLike.scala:230) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:195) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:192) at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287) at org.apache.spark.sql.Dataset.head(Dataset.scala:2470) at org.apache.spark.sql.Dataset.take(Dataset.scala:2684) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262) at org.apache.spark.sql.Dataset.showString(Dataset.scala:299) at org.apache.spark.sql.Dataset.show(Dataset.scala:753) at org.apache.spark.sql.Dataset.show(Dataset.scala:712) at org.apache.spark.sql.Dataset.show(Dataset.scala:721) {code} The test case {{SPARK-23957 Remove redundant sort from subquery plan(scalar subquery)}} in {{SubquerySuite}} has been disabled because of hitting this issue, caught by SPARK-26735. We should re-enable that test once this bug is fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26735) Verify plan integrity for special expressions
Kris Mok created SPARK-26735: Summary: Verify plan integrity for special expressions Key: SPARK-26735 URL: https://issues.apache.org/jira/browse/SPARK-26735 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kris Mok Add verification of plan integrity with regards to special expressions being hosted only in supported operators. Specifically: * AggregateExpression: should only be hosted in Aggregate, or indirectly in Window * WindowExpression: should only be hosted in Window * Generator: should only be hosted in Generate -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26664) Make DecimalType's minimum adjusted scale configurable
Kris Mok created SPARK-26664: Summary: Make DecimalType's minimum adjusted scale configurable Key: SPARK-26664 URL: https://issues.apache.org/jira/browse/SPARK-26664 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kris Mok Introduce a new conf flag that allows the user to set the value of {{DecimalType.MINIMAL_ADJUSTED_SCALE}}, currently a constant of 6, to match their workloads' needs. The new flag will be {{spark.sql.decimalOperations.minimumAdjustedScale}}. SPARK-22036 introduced a new conf flag {{spark.sql.decimalOperations.allowPrecisionLoss}} to match SQL Server's and new Hive's behavior of allowing precision loss when performing multiplication/division on big and small decimal numbers. Along with this feature, a fixed {{MINIMAL_ADJUSTED_SCALE}} was set to 6 for when precision loss is allowed. Some customer workload may needed a larger adjusted scale to match their business needs, and in exchange they may be willing to tolerate some more calculations overflowing the max precision, leading to nulls. So they would like the minimum adjusted scale to be configurable. Thus the need for a new conf. The default behavior after introducing this conf flag is not changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26661) Show actual class name of the writing command in CTAS explain
Kris Mok created SPARK-26661: Summary: Show actual class name of the writing command in CTAS explain Key: SPARK-26661 URL: https://issues.apache.org/jira/browse/SPARK-26661 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Kris Mok The explain output of the Hive CTAS command, regardless of whether it's actually writing via Hive's SerDe or converted into using Spark's data source, would always show that it's using {{InsertIntoHiveTable}} because it's hardcoded. e.g. {code:none} Execute OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: foo, InsertIntoHiveTable] {code} This CTAS is converted into using Spark's data source, but it still says {{InsertIntoHiveTable}} in the explain output. It's better to show the actual class name of the writing command used. For the example above, it'd be: {code:none} Execute OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: foo, InsertIntoHadoopFsRelationCommand] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26659) Duplicate cmd.nodeName in the explain output of DataWritingCommandExec
Kris Mok created SPARK-26659: Summary: Duplicate cmd.nodeName in the explain output of DataWritingCommandExec Key: SPARK-26659 URL: https://issues.apache.org/jira/browse/SPARK-26659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Kris Mok {{DataWritingCommandExec}} generates {{cmd.nodeName}} twice in its explain output, e.g. when running this query {code:none} spark.sql("create table foo stored as parquet as select id, id % 10 as cat1, id % 20 as cat2 from range(10)") {code} The query plan's explain output would be: {code:none} Execute OptimizedCreateHiveTableAsSelectCommand OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: foo, InsertIntoHiveTable] +- *(1) Project [id#2L, (id#2L % 10) AS cat1#0L, (id#2L % 20) AS cat2#1L] +- *(1) Range (0, 10, step=1, splits=8) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26633) Add ExecutorClassLoader.getResourceAsStream
Kris Mok created SPARK-26633: Summary: Add ExecutorClassLoader.getResourceAsStream Key: SPARK-26633 URL: https://issues.apache.org/jira/browse/SPARK-26633 Project: Spark Issue Type: Improvement Components: Spark Shell Affects Versions: 3.0.0 Reporter: Kris Mok {{ExecutorClassLoader}} is capable of loading dynamically generated classes from the REPL via either RPC or HDFS, but right now it always delegates resource loading to the parent class loader. That makes the dynamically generated classes unavailable to uses other than class loading. Such need may arise, for example, when json4s wants to parse the Class file to extract parameter name information. Internally it'd call the class loader's {{getResourceAsStream}} to obtain the Class file content as an {{InputStream}}. This ticket tracks an improvement to the {{ExecutorClassLoader}} to allow fetching dynamically generated Class files from the REPL as resource streams. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26545) Fix typo in EqualNullSafe's truth table comment
Kris Mok created SPARK-26545: Summary: Fix typo in EqualNullSafe's truth table comment Key: SPARK-26545 URL: https://issues.apache.org/jira/browse/SPARK-26545 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Kris Mok The truth table comment in {{EqualNullSafe}} incorrectly marked FALSE results as UNKNOWN -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26352) join reordering should not change the order of output attributes
[ https://issues.apache.org/jira/browse/SPARK-26352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kris Mok updated SPARK-26352: - Summary: join reordering should not change the order of output attributes (was: ReorderJoin should not change the order of columns) > join reordering should not change the order of output attributes > > > Key: SPARK-26352 > URL: https://issues.apache.org/jira/browse/SPARK-26352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Kris Mok >Priority: Major > > The optimizer rule {{org.apache.spark.sql.catalyst.optimizer.ReorderJoin}} > performs join reordering on inner joins. This was introduced from SPARK-12032 > in 2015-12. > After it had reordered the joins, though, it didn't check whether or not the > column order (in terms of the {{output}} attribute list) is still the same as > before. Thus, it's possible to have a mismatch between the reordered column > order vs the schema that a DataFrame thinks it has. > This can be demonstrated with the example: > {code:none} > spark.sql("create table table_a (x int, y int) using parquet") > spark.sql("create table table_b (i int, j int) using parquet") > spark.sql("create table table_c (a int, b int) using parquet") > val df = spark.sql("with df1 as (select * from table_a cross join table_b) > select * from df1 join table_c on a = x and b = i") > {code} > here's what the DataFrame thinks: > {code:none} > scala> df.printSchema > root > |-- x: integer (nullable = true) > |-- y: integer (nullable = true) > |-- i: integer (nullable = true) > |-- j: integer (nullable = true) > |-- a: integer (nullable = true) > |-- b: integer (nullable = true) > {code} > here's what the optimized plan thinks, after join reordering: > {code:none} > scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- > ${a.name}: ${a.dataType.typeName}")) > |-- x: integer > |-- y: integer > |-- a: integer > |-- b: integer > |-- i: integer > |-- j: integer > {code} > If we exclude the {{ReorderJoin}} rule (using Spark 2.4's optimizer rule > exclusion feature), it's back to normal: > {code:none} > scala> spark.conf.set("spark.sql.optimizer.excludedRules", > "org.apache.spark.sql.catalyst.optimizer.ReorderJoin") > scala> val df = spark.sql("with df1 as (select * from table_a cross join > table_b) select * from df1 join table_c on a = x and b = i") > df: org.apache.spark.sql.DataFrame = [x: int, y: int ... 4 more fields] > scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- > ${a.name}: ${a.dataType.typeName}")) > |-- x: integer > |-- y: integer > |-- i: integer > |-- j: integer > |-- a: integer > |-- b: integer > {code} > Note that this column ordering problem leads to data corruption, and can > manifest itself in various symptoms: > * Silently corrupting data, if the reordered columns happen to either have > matching types or have sufficiently-compatible types (e.g. all fixed length > primitive types are considered as "sufficiently compatible" in an UnsafeRow), > then only the resulting data is going to be wrong but it might not trigger > any alarms immediately. Or > * Weird Java-level exceptions like {{java.lang.NegativeArraySizeException}}, > or even SIGSEGVs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26352) ReorderJoin should not change the order of columns
Kris Mok created SPARK-26352: Summary: ReorderJoin should not change the order of columns Key: SPARK-26352 URL: https://issues.apache.org/jira/browse/SPARK-26352 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.0 Reporter: Kris Mok The optimizer rule {{org.apache.spark.sql.catalyst.optimizer.ReorderJoin}} performs join reordering on inner joins. This was introduced from SPARK-12032 in 2015-12. After it had reordered the joins, though, it didn't check whether or not the column order (in terms of the {{output}} attribute list) is still the same as before. Thus, it's possible to have a mismatch between the reordered column order vs the schema that a DataFrame thinks it has. This can be demonstrated with the example: {code:none} spark.sql("create table table_a (x int, y int) using parquet") spark.sql("create table table_b (i int, j int) using parquet") spark.sql("create table table_c (a int, b int) using parquet") val df = spark.sql("with df1 as (select * from table_a cross join table_b) select * from df1 join table_c on a = x and b = i") {code} here's what the DataFrame thinks: {code:none} scala> df.printSchema root |-- x: integer (nullable = true) |-- y: integer (nullable = true) |-- i: integer (nullable = true) |-- j: integer (nullable = true) |-- a: integer (nullable = true) |-- b: integer (nullable = true) {code} here's what the optimized plan thinks, after join reordering: {code:none} scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- ${a.name}: ${a.dataType.typeName}")) |-- x: integer |-- y: integer |-- a: integer |-- b: integer |-- i: integer |-- j: integer {code} If we exclude the {{ReorderJoin}} rule (using Spark 2.4's optimizer rule exclusion feature), it's back to normal: {code:none} scala> spark.conf.set("spark.sql.optimizer.excludedRules", "org.apache.spark.sql.catalyst.optimizer.ReorderJoin") scala> val df = spark.sql("with df1 as (select * from table_a cross join table_b) select * from df1 join table_c on a = x and b = i") df: org.apache.spark.sql.DataFrame = [x: int, y: int ... 4 more fields] scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- ${a.name}: ${a.dataType.typeName}")) |-- x: integer |-- y: integer |-- i: integer |-- j: integer |-- a: integer |-- b: integer {code} Note that this column ordering problem leads to data corruption, and can manifest itself in various symptoms: * Silently corrupting data, if the reordered columns happen to either have matching types or have sufficiently-compatible types (e.g. all fixed length primitive types are considered as "sufficiently compatible" in an UnsafeRow), then only the resulting data is going to be wrong but it might not trigger any alarms immediately. Or * Weird Java-level exceptions like {{java.lang.NegativeArraySizeException}}, or even SIGSEGVs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26107) Extend ReplaceNullWithFalseInPredicate to support higher-order functions: ArrayExists, ArrayFilter, MapFilter
Kris Mok created SPARK-26107: Summary: Extend ReplaceNullWithFalseInPredicate to support higher-order functions: ArrayExists, ArrayFilter, MapFilter Key: SPARK-26107 URL: https://issues.apache.org/jira/browse/SPARK-26107 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kris Mok Extend the {{ReplaceNullWithFalse}} optimizer rule introduced in SPARK-25860 to also support optimizing predicates in higher-order functions of {{ArrayExists}}, {{ArrayFilter}}, {{MapFilter}}. Also rename the rule to {{ReplaceNullWithFalseInPredicate}} to better reflect its intent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26061) Reduce the number of unused UnsafeRowWriters created in whole-stage codegen
Kris Mok created SPARK-26061: Summary: Reduce the number of unused UnsafeRowWriters created in whole-stage codegen Key: SPARK-26061 URL: https://issues.apache.org/jira/browse/SPARK-26061 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0 Reporter: Kris Mok Reduce the number of unused UnsafeRowWriters created in whole-stage generated code. They come from the CodegenSupport.consume() calling prepareRowVar(), which uses GenerateUnsafeProjection.createCode() and registers an UnsafeRowWriter mutable state, regardless of whether or not the downstream (parent) operator will use the rowVar or not. Even when the downstream doConsume function doesn't use the rowVar (i.e. doesn't put row.code as a part of this operator's codegen template), the registered UnsafeRowWriter stays there, which makes the init function of the generated code a bit bloated. This ticket doesn't track the root issue, but makes it slightly less painful: when the doConsume function is split out, the prepareRowVar() function is called twice, so it's double the pain of unused UnsafeRowWriters. This fix simply moves the original call to prepareRowVar() down into the doConsume split/no-split branch so that we're back to just 1x the pain. To fix the root issue, something that allows the CodegenSupport operators to indicate whether or not they're going to use the rowVar would be needed. That's a much more elaborate change so I'd like to just make a minor fix first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25961) Random numbers are not supported when handling data skew
[ https://issues.apache.org/jira/browse/SPARK-25961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680152#comment-16680152 ] Kris Mok commented on SPARK-25961: -- It looks like the current restriction makes sense, because the expressions in join condition may eventually be evaluated multiple times depending on which physical join operator is chosen. It doesn't make a lot of sense to allow non-deterministic expression directly in the Join operator. Instead, if we have to support having non-deterministic expression in the join condition and retain an "evaluated-once" semantic, it'd be better to have a rule in the Analyzer to extract non-deterministic expressions from the join condition and put it into a child Project operator on the appropriate side. [~zengxl] does that make sense to you? > Random numbers are not supported when handling data skew > > > Key: SPARK-25961 > URL: https://issues.apache.org/jira/browse/SPARK-25961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: spark on yarn 2.3.1 >Reporter: zengxl >Priority: Major > > my query sql use two table join,one table join key has null value,i use rand > value instead of null value,but has error,the error info as follows: > Error in query: nondeterministic expressions are only allowed in > Project, Filter, Aggregate or Window, found > > > scan spark source code is > org.apache.spark.sql.catalyst.analysis.CheckAnalysis check sql, because the > number of random variables is uncertain, it is prohibited > case o if o.expressions.exists(!_.deterministic) && > !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] && > !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] => > // The rule above is used to check Aggregate operator. > failAnalysis( > s"""nondeterministic expressions are only allowed in > |Project, Filter, Aggregate or Window, found:| > |${o.expressions.map(_.sql).mkString(",")}| > |in operator ${operator.simpleString} > """.stripMargin)| > > Is it possible to add Join to this code? It's not yet tested.And whether > there will be other effects > case o if o.expressions.exists(!_.deterministic) && > !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] && > !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] +{color:#d04437}&& > !o.isInstanceOf[Join]{color}+ => > // The rule above is used to check Aggregate operator. > failAnalysis( > s"""nondeterministic expressions are only allowed in > |Project, Filter, Aggregate or Window or Join, found:| > |${o.expressions.map(_.sql).mkString(",")}| > |in operator ${operator.simpleString} > """.stripMargin)| > > this is my sparksql: > SELECT > T1.CUST_NO AS CUST_NO , > T3.CON_LAST_NAME AS CUST_NAME , > T3.CON_SEX_MF AS SEX_CODE , > T3.X_POSITION AS POST_LV_CODE > FROM tmp.ICT_CUST_RANGE_INFO T1 > LEFT join tmp.F_CUST_BASE_INFO_ALL T3 ON CASE WHEN coalesce(T1.CUST_NO,'') > ='' THEN concat('cust_no',RAND()) ELSE T1.CUST_NO END = T3.BECIF and > T3.DATE='20181105' > WHERE T1.DATE='20181105' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25494) Upgrade Spark's use of Janino to 3.0.10
Kris Mok created SPARK-25494: Summary: Upgrade Spark's use of Janino to 3.0.10 Key: SPARK-25494 URL: https://issues.apache.org/jira/browse/SPARK-25494 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Kris Mok This ticket proposes to upgrade Spark's use of Janino from 3.0.9 to 3.0.10. Note that 3.0.10 is a out-of-band release specifically for fixing an integer overflow issue in Janino's {{ClassFile}} reader. It is otherwise exactly the same as 3.0.9, so it's a low risk and compatible upgrade. The integer overflow issue affects Spark SQL's codegen stats collection: when a generated Class file is huge, especially when the constant pool size is above {{Short.MAX_VALUE}}, Janino's {{ClassFile}} reader will throw an exception when Spark wants to parse the generated Class file to collect stats. So we'll miss the stats of some huge Class files. The Janino fix is tracked by this issue: https://github.com/janino-compiler/janino/issues/58 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field
[ https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587820#comment-16587820 ] Kris Mok commented on SPARK-25178: -- FYI: I've just come across this behavior and raised a ticket so that we don't forget about it, but I'm not working on this one (yet). > Use dummy name for xxxHashMapGenerator key/value schema field > - > > Key: SPARK-25178 > URL: https://issues.apache.org/jira/browse/SPARK-25178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > Following SPARK-18952 and SPARK-22273, this ticket proposes to change the > generated field name of the keySchema / valueSchema to a dummy name instead > of using {{key.name}}. > In previous discussion from SPARK-18952's PR [1], it was already suggested > that the field names were being used, so it's not worth capturing the strings > as reference objects here. Josh suggested merging the original fix as-is due > to backportability / pickability concerns. Now that we're coming up to a new > release, this can be revisited. > [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field
Kris Mok created SPARK-25178: Summary: Use dummy name for xxxHashMapGenerator key/value schema field Key: SPARK-25178 URL: https://issues.apache.org/jira/browse/SPARK-25178 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Kris Mok Following SPARK-18952 and SPARK-22273, this ticket proposes to change the generated field name of the keySchema / valueSchema to a dummy name instead of using {{key.name}}. In previous discussion from SPARK-18952's PR [1], it was already suggested that the field names were being used, so it's not worth capturing the strings as reference objects here. Josh suggested merging the original fix as-is due to backportability / pickability concerns. Now that we're coming up to a new release, this can be revisited. [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583486#comment-16583486 ] Kris Mok commented on SPARK-25140: -- Thanks maropu-san! No, I'm not working on this ticket; I'm currently swamped by other tasks close to a deadline so I couldn't work on this one right now. I was just thinking about resetting my mental model for the Spark 2.4.x code execution performance, and saw that after the fallback was implemented, it's very hard for the user to figure out whether or not the interpreter fallback is taking effect and/or whether or not it's contributing to slow performance. This would be valuable information to have for query tuning, etc. > Add optional logging to UnsafeProjection.create when it falls back to > interpreted mode > -- > > Key: SPARK-25140 > URL: https://issues.apache.org/jira/browse/SPARK-25140 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection > to fall back to an interpreter mode when codegen fails. That makes Spark much > more usable even when codegen is unable to handle the given query. > But in its current form, the fallback handling can also be a mystery in terms > of performance cliffs. Users may be left wondering why a query runs fine with > some expressions, but then with just one extra expression the performance > goes 2x, 3x (or more) slower. > It'd be nice to have optional logging of the fallback behavior, so that for > users that care about monitoring performance cliffs, they can opt-in to log > when a fallback to interpreter mode was taken. i.e. at > https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kris Mok updated SPARK-25140: - Description: SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection to fall back to an interpreter mode when codegen fails. That makes Spark much more usable even when codegen is unable to handle the given query. But in its current form, the fallback handling can also be a mystery in terms of performance cliffs. Users may be left wondering why a query runs fine with some expressions, but then with just one extra expression the performance goes 2x, 3x (or more) slower. It'd be nice to have optional logging of the fallback behavior, so that for users that care about monitoring performance cliffs, they can opt-in to log when a fallback to interpreter mode was taken. i.e. at https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183 was: SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection to fall back to an interpreter mode when codegen fails. That makes Spark much more usable even when codegen is unable to handle the given query. But in its current form, the fallback handling can also be a mystery in terms of performance cliffs. Users may be left wondering why a query runs fine with some expressions, but then with just one extra expression the performance goes 2x, 3x (or more) slower. It'd be nice to have optional logging of the fallback behavior, so that for users that care about monitoring performance cliffs, they can opt-in to log when a fallback to interpreter mode was taken. > Add optional logging to UnsafeProjection.create when it falls back to > interpreted mode > -- > > Key: SPARK-25140 > URL: https://issues.apache.org/jira/browse/SPARK-25140 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection > to fall back to an interpreter mode when codegen fails. That makes Spark much > more usable even when codegen is unable to handle the given query. > But in its current form, the fallback handling can also be a mystery in terms > of performance cliffs. Users may be left wondering why a query runs fine with > some expressions, but then with just one extra expression the performance > goes 2x, 3x (or more) slower. > It'd be nice to have optional logging of the fallback behavior, so that for > users that care about monitoring performance cliffs, they can opt-in to log > when a fallback to interpreter mode was taken. i.e. at > https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode
Kris Mok created SPARK-25140: Summary: Add optional logging to UnsafeProjection.create when it falls back to interpreted mode Key: SPARK-25140 URL: https://issues.apache.org/jira/browse/SPARK-25140 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 2.4.0 Reporter: Kris Mok SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection to fall back to an interpreter mode when codegen fails. That makes Spark much more usable even when codegen is unable to handle the given query. But in its current form, the fallback handling can also be a mystery in terms of performance cliffs. Users may be left wondering why a query runs fine with some expressions, but then with just one extra expression the performance goes 2x, 3x (or more) slower. It'd be nice to have optional logging of the fallback behavior, so that for users that care about monitoring performance cliffs, they can opt-in to log when a fallback to interpreter mode was taken. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25138) Spark Shell should show the Scala prompt after initialization is complete
Kris Mok created SPARK-25138: Summary: Spark Shell should show the Scala prompt after initialization is complete Key: SPARK-25138 URL: https://issues.apache.org/jira/browse/SPARK-25138 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.4.0 Reporter: Kris Mok In previous Spark versions, the Spark Shell used to only show the Scala prompt *after* Spark has initialized. i.e. when the user is able to enter code, the Spark context, Spark session etc have all completed initialization, so {{sc}}, {{spark}} are all ready to use. In the current Spark master branch (to become Spark 2.4.0), the Scala prompt shows up immediately, while Spark itself is still in initialization in the background. It's very easy for the user to feel as if the shell is ready and start typing, only to find that Spark isn't ready yet, and Spark's initialization logs get in the way of typing. This new behavior is rather annoying from a usability's perspective. A typical startup of the Spark Shell in current master: {code:none} $ bin/spark-shell 18/08/16 23:18:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0-SNAPSHOT /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> spark.range(1)Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1534486692744). Spark session available as 'spark'. .show +---+ | id| +---+ | 0| +---+ scala> {code} Could you see that it was running {{spark.range(1).show}} ? In contrast, previous versions of Spark Shell would wait for Spark to fully initialization: {code:none} $ bin/spark-shell 18/08/16 23:20:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://10.0.0.76:4040 Spark context available as 'sc' (master = local[*], app id = local-1534486813159). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> spark.range(1).show +---+ | id| +---+ | 0| +---+ scala> {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit
Kris Mok created SPARK-25113: Summary: Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit Key: SPARK-25113 URL: https://issues.apache.org/jira/browse/SPARK-25113 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Kris Mok Add logging to help collect statistics on how often real world usage sees the {{CodeGenerator}} generating methods whose bytecode size goes above the 8000 bytes (HugeMethodLimit) threshold. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24659) GenericArrayData.equals should respect element type differences
Kris Mok created SPARK-24659: Summary: GenericArrayData.equals should respect element type differences Key: SPARK-24659 URL: https://issues.apache.org/jira/browse/SPARK-24659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1, 2.3.0, 2.4.0 Reporter: Kris Mok Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect element type differences, due to a caveat in Scala's {{==}} operator. e.g. {{new GenericArrayData(Array[Int](123)).equals(new GenericArrayData(Array[Long](123L)))}} currently returns true. But that's against the semantics of Spark SQL's array type, where {{array}} and {{array}} are considered to be incompatible types and thus should never be equal. This ticket proposes to fix the implementation of {{GenericArrayData.equals}} so that it's more aligned to Spark SQL's array type semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24321) Extract common code from Divide/Remainder to a base trait
Kris Mok created SPARK-24321: Summary: Extract common code from Divide/Remainder to a base trait Key: SPARK-24321 URL: https://issues.apache.org/jira/browse/SPARK-24321 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Kris Mok There's a lot of code duplication between {{Divide}} and {{Remainder}} expression types. They're mostly the codegen template (which is exactly the same, with just cosmetic differences), the eval function structure, etc. It tedious to have to update multiple places in case we make improvements to the codegen templates in the future. This ticket proposes to refactor the duplicate code into a common base trait for these two classes. Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} and {{Remainder}}, so in theory we can make a deeper refactoring to accommodate this class as well. But the "operation" part of its codegen template is harder to factor into the base trait, so this ticket only handles {{Divide}} and {{Remainder}} for now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23960) Mark HashAggregateExec.bufVars as transient
Kris Mok created SPARK-23960: Summary: Mark HashAggregateExec.bufVars as transient Key: SPARK-23960 URL: https://issues.apache.org/jira/browse/SPARK-23960 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kris Mok {{HashAggregateExec.bufVars}} is only used during codegen for global aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is on the stack. Currently, if an {{HashAggregateExec}} is ever captured for serialization, the {{bufVars}} would be needlessly serialized. This ticket proposes a minor change to mark the {{bufVars}} field as transient to avoid serializing it. Also, null out this field at the end of {{doProduceWithoutKeys()}} to reduce its lifecycle so that the {{Seq[ExprCode]}} being referenced can be GC'd sooner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23947) Add hashUTF8String convenience method to hasher classes
Kris Mok created SPARK-23947: Summary: Add hashUTF8String convenience method to hasher classes Key: SPARK-23947 URL: https://issues.apache.org/jira/browse/SPARK-23947 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Kris Mok Add {{hashUTF8String()}} to the hasher classes to allow Spark SQL codegen to generate cleaner code for hashing {{UTF8String}}. No change in behavior otherwise. Although with the introduction of SPARK-10399, the code size for hashing {{UTF8String}} is already smaller, it's still good to extract a separate function in the hasher classes so that the generated code can stay clean. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly
Kris Mok created SPARK-23760: Summary: CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly Key: SPARK-23760 URL: https://issues.apache.org/jira/browse/SPARK-23760 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.1, 2.2.0 Reporter: Kris Mok There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes it effectively always clear the subexpression elimination state after it's called. The original intent of this function was that it should save the old state, set the given new state as current and perform codegen (invoke {{Expression.genCode()}}), and at the end restore the subexpression elimination state back to the old state. This ticket tracks a fix to actually implement the original intent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23447) Cleanup codegen template for Literal
Kris Mok created SPARK-23447: Summary: Cleanup codegen template for Literal Key: SPARK-23447 URL: https://issues.apache.org/jira/browse/SPARK-23447 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1, 2.2.0 Reporter: Kris Mok Ideally, the codegen templates for {{Literal}} should emit literals in the {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be effectively inlined into their use sites. But currently there are a couple of paths where {{Literal.doGenCode()}} return {{ExprCode}} that has non-trivial {{code}} field, and all of those are actually unnecessary. We can make a simple refactoring to make sure all codegen templates for {{Literal}} return empty {{code}} and simple literal/constant expressions in {{isNull}} and {{value}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23021) AnalysisBarrier should not cut off the explain output for Parsed Logical Plan
[ https://issues.apache.org/jira/browse/SPARK-23021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324318#comment-16324318 ] Kris Mok commented on SPARK-23021: -- Thank you very much for explaining the differences between {{children}} and {{innerChildren}}, [~viirya] and [~cloud_fan]! TIL :-) > AnalysisBarrier should not cut off the explain output for Parsed Logical Plan > - > > Key: SPARK-23021 > URL: https://issues.apache.org/jira/browse/SPARK-23021 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok > > In PR#20094 as a follow up to SPARK-20392, there were some fixes to the > handling of {{AnalysisBarrier}}, but there seem to be more cases that need to > be fixed. > One such case is that right now the Parsed Logical Plan in explain output > would be cutoff by {{AnalysisBarrier}}, e.g. > {code:none} > scala> val df1 = spark.range(1).select('id as 'x, 'id + 1 as > 'y).repartition(1).select('x === 'y) > df1: org.apache.spark.sql.DataFrame = [(x = y): boolean] > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [('x = 'y) AS (x = y)#22] > +- AnalysisBarrier Repartition 1, true > == Analyzed Logical Plan == > (x = y): boolean > Project [(x#16L = y#17L) AS (x = y)#22] > +- Repartition 1, true >+- Project [id#13L AS x#16L, (id#13L + cast(1 as bigint)) AS y#17L] > +- Range (0, 1, step=1, splits=Some(8)) > == Optimized Logical Plan == > Project [(x#16L = y#17L) AS (x = y)#22] > +- Repartition 1, true >+- Project [id#13L AS x#16L, (id#13L + 1) AS y#17L] > +- Range (0, 1, step=1, splits=Some(8)) > == Physical Plan == > *Project [(x#16L = y#17L) AS (x = y)#22] > +- Exchange RoundRobinPartitioning(1) >+- *Project [id#13L AS x#16L, (id#13L + 1) AS y#17L] > +- *Range (0, 1, step=1, splits=8) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23021) AnalysisBarrier should not cut off the explain output for Parsed Logical Plan
[ https://issues.apache.org/jira/browse/SPARK-23021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323672#comment-16323672 ] Kris Mok commented on SPARK-23021: -- Hi [~maropu]-san, Thanks! I'm not familiar with this part of the code, but IIUC there's two points to this: 1. The explain output shouldn't be changed because of the newly added {{AnalysisBarrier}}. The current behavior (with the cutoff) should be considered a regression, although it's not a major behavioral one. 2. The reason why {{AnalysisBarrier}} didn't override {{innerChildren}} was by design: it wanted to cut off the recursion to help avoid stack overflows when the analyzer needs to walk the plan tree. [~cloud_fan] may have recently fixed a similar issue where the "analyzed logical plan" was cut off by {{AnalysisBarrier}} in the same way. Is that the case, [~cloud_fan]? > AnalysisBarrier should not cut off the explain output for Parsed Logical Plan > - > > Key: SPARK-23021 > URL: https://issues.apache.org/jira/browse/SPARK-23021 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok > > In PR#20094 as a follow up to SPARK-20392, there were some fixes to the > handling of {{AnalysisBarrier}}, but there seem to be more cases that need to > be fixed. > One such case is that right now the Parsed Logical Plan in explain output > would be cutoff by {{AnalysisBarrier}}, e.g. > {code:none} > scala> val df1 = spark.range(1).select('id as 'x, 'id + 1 as > 'y).repartition(1).select('x === 'y) > df1: org.apache.spark.sql.DataFrame = [(x = y): boolean] > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [('x = 'y) AS (x = y)#22] > +- AnalysisBarrier Repartition 1, true > == Analyzed Logical Plan == > (x = y): boolean > Project [(x#16L = y#17L) AS (x = y)#22] > +- Repartition 1, true >+- Project [id#13L AS x#16L, (id#13L + cast(1 as bigint)) AS y#17L] > +- Range (0, 1, step=1, splits=Some(8)) > == Optimized Logical Plan == > Project [(x#16L = y#17L) AS (x = y)#22] > +- Repartition 1, true >+- Project [id#13L AS x#16L, (id#13L + 1) AS y#17L] > +- Range (0, 1, step=1, splits=Some(8)) > == Physical Plan == > *Project [(x#16L = y#17L) AS (x = y)#22] > +- Exchange RoundRobinPartitioning(1) >+- *Project [id#13L AS x#16L, (id#13L + 1) AS y#17L] > +- *Range (0, 1, step=1, splits=8) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23021) AnalysisBarrier should not cut off the explain output for Parsed Logical Plan
[ https://issues.apache.org/jira/browse/SPARK-23021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323612#comment-16323612 ] Kris Mok commented on SPARK-23021: -- Hi [~maropu]-san, Thanks for looking at it! No, I only noticed this behavior while working on something else. I wasn't planning on working on this ticket, unless no one takes it and I get too curious ;-) Would you like to submit that patch as a PR? Thanks, Kris > AnalysisBarrier should not cut off the explain output for Parsed Logical Plan > - > > Key: SPARK-23021 > URL: https://issues.apache.org/jira/browse/SPARK-23021 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok > > In PR#20094 as a follow up to SPARK-20392, there were some fixes to the > handling of {{AnalysisBarrier}}, but there seem to be more cases that need to > be fixed. > One such case is that right now the Parsed Logical Plan in explain output > would be cutoff by {{AnalysisBarrier}}, e.g. > {code:none} > scala> val df1 = spark.range(1).select('id as 'x, 'id + 1 as > 'y).repartition(1).select('x === 'y) > df1: org.apache.spark.sql.DataFrame = [(x = y): boolean] > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [('x = 'y) AS (x = y)#22] > +- AnalysisBarrier Repartition 1, true > == Analyzed Logical Plan == > (x = y): boolean > Project [(x#16L = y#17L) AS (x = y)#22] > +- Repartition 1, true >+- Project [id#13L AS x#16L, (id#13L + cast(1 as bigint)) AS y#17L] > +- Range (0, 1, step=1, splits=Some(8)) > == Optimized Logical Plan == > Project [(x#16L = y#17L) AS (x = y)#22] > +- Repartition 1, true >+- Project [id#13L AS x#16L, (id#13L + 1) AS y#17L] > +- Range (0, 1, step=1, splits=Some(8)) > == Physical Plan == > *Project [(x#16L = y#17L) AS (x = y)#22] > +- Exchange RoundRobinPartitioning(1) >+- *Project [id#13L AS x#16L, (id#13L + 1) AS y#17L] > +- *Range (0, 1, step=1, splits=8) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec
[ https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kris Mok updated SPARK-23032: - Description: Proposing to add a per-query ID to the codegen stages as represented by {{WholeStageCodegenExec}} operators. This ID will be used in * the explain output of the physical plan, and in * the generated class name. Specifically, this ID will be stable within a query, counting up from 1 in depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a plan. The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} objects, which may have been created for one-off purposes, e.g. for fallback handling of codegen stages that failed to codegen the whole stage and wishes to codegen a subset of the children operators. Example: for the following query: {code:none} scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1) scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 'y).orderBy('x).select('x + 1 as 'z, 'y) df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint] scala> val df2 = spark.range(5) df2: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> val query = df1.join(df2, 'z === 'id) query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more field] {code} The explain output before the change is: {code:none} scala> query.explain == Physical Plan == *SortMergeJoin [z#9L], [id#13L], Inner :- *Sort [z#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(z#9L, 200) : +- *Project [(x#3L + 1) AS z#9L, y#4L] :+- *Sort [x#3L ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) : +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] : +- *Range (0, 10, step=1, splits=8) +- *Sort [id#13L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#13L, 200) +- *Range (0, 5, step=1, splits=8) {code} Note how codegen'd operators are annotated with a prefix {{"*"}}. and after this change it'll be: {code:none} scala> query.explain == Physical Plan == *(6) SortMergeJoin [z#9L], [id#13L], Inner :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(z#9L, 200) : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L] :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) : +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] : +- *(1) Range (0, 10, step=1, splits=8) +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#13L, 200) +- *(4) Range (0, 5, step=1, splits=8) {code} Note that the annotated prefix becomes {{"*(id) "}} It'll also show up in the name of the generated class, as a suffix in the format of {code:none} GeneratedClass$GeneratedIterator$id {code} for example, note how {{GeneratedClass$GeneratedIteratorForCodegenStage3}} and {{GeneratedClass$GeneratedIteratorForCodegenStage6}} in the following stack trace corresponds to the IDs shown in the explain output above: {code:none} "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 nid=NA runnable java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.sort_addToSorter$(generated.java:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(generated.java:41) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.findNextInnerJoinRows$(generated.java:42) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(generated.java:101) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
[jira] [Created] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec
Kris Mok created SPARK-23032: Summary: Add a per-query codegenStageId to WholeStageCodegenExec Key: SPARK-23032 URL: https://issues.apache.org/jira/browse/SPARK-23032 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kris Mok Proposing to add a per-query ID to the codegen stages as represented by {{WholeStageCodegenExec}} operators. This ID will be used in * the explain output of the physical plan, and in * the generated class name. Specifically, this ID will be stable within a query, counting up from 1 in depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a plan. The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} objects, which may have been created for one-off purposes, e.g. for fallback handling of codegen stages that failed to codegen the whole stage and wishes to codegen a subset of the children operators. Example: for the following query: {code:none} scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1) scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 'y).orderBy('x).select('x + 1 as 'z, 'y) df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint] scala> val df2 = spark.range(5) df2: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> val query = df1.join(df2, 'z === 'id) query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more field] {code} The explain output before the change is: {code:none} scala> query.explain == Physical Plan == *SortMergeJoin [z#9L], [id#13L], Inner :- *Sort [z#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(z#9L, 200) : +- *Project [(x#3L + 1) AS z#9L, y#4L] :+- *Sort [x#3L ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) : +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] : +- *Range (0, 10, step=1, splits=8) +- *Sort [id#13L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#13L, 200) +- *Range (0, 5, step=1, splits=8) {code} Note how codegen'd operators are annotated with a prefix {{"*"}}. and after this change it'll be: {code:none} scala> query.explain == Physical Plan == *(6) SortMergeJoin [z#9L], [id#13L], Inner :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(z#9L, 200) : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L] :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) : +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] : +- *(1) Range (0, 10, step=1, splits=8) +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#13L, 200) +- *(4) Range (0, 5, step=1, splits=8) {code} Note that the annotated prefix becomes {{"*(id) "}} It'll also show up in the name of the generated class, as a suffix in the format of {code:none} GeneratedClass$GeneratedIterator$id {code} for example, note how {{GeneratedClass$GeneratedIterator$3}} and {{GeneratedClass$GeneratedIterator$6}} in the following stack trace corresponds to the IDs shown in the explain output above: {code:none} "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 nid=NA runnable java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.sort_addToSorter$(generated.java:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.processNext(generated.java:41) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.findNextInnerJoinRows$(generated.java:42) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.processNext(generated.java:101) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828) at
[jira] [Created] (SPARK-23021) AnalysisBarrier should not cut off the explain output for Parsed Logical Plan
Kris Mok created SPARK-23021: Summary: AnalysisBarrier should not cut off the explain output for Parsed Logical Plan Key: SPARK-23021 URL: https://issues.apache.org/jira/browse/SPARK-23021 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Kris Mok Priority: Minor In PR#20094 as a follow up to SPARK-20392, there were some fixes to the handling of {{AnalysisBarrier}}, but there seem to be more cases that need to be fixed. One such case is that right now the Parsed Logical Plan in explain output would be cutoff by {{AnalysisBarrier}}, e.g. {code:none} scala> val df1 = spark.range(1).select('id as 'x, 'id + 1 as 'y).repartition(1).select('x === 'y) df1: org.apache.spark.sql.DataFrame = [(x = y): boolean] scala> df1.explain(true) == Parsed Logical Plan == 'Project [('x = 'y) AS (x = y)#22] +- AnalysisBarrier Repartition 1, true == Analyzed Logical Plan == (x = y): boolean Project [(x#16L = y#17L) AS (x = y)#22] +- Repartition 1, true +- Project [id#13L AS x#16L, (id#13L + cast(1 as bigint)) AS y#17L] +- Range (0, 1, step=1, splits=Some(8)) == Optimized Logical Plan == Project [(x#16L = y#17L) AS (x = y)#22] +- Repartition 1, true +- Project [id#13L AS x#16L, (id#13L + 1) AS y#17L] +- Range (0, 1, step=1, splits=Some(8)) == Physical Plan == *Project [(x#16L = y#17L) AS (x = y)#22] +- Exchange RoundRobinPartitioning(1) +- *Project [id#13L AS x#16L, (id#13L + 1) AS y#17L] +- *Range (0, 1, step=1, splits=8) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime
Kris Mok created SPARK-22966: Summary: Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime Key: SPARK-22966 URL: https://issues.apache.org/jira/browse/SPARK-22966 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.1, 2.2.0 Reporter: Kris Mok Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand internally, and will thus give incorrect results. e.g. {code:python} >>> import datetime >>> from pyspark.sql import * >>> py_date = udf(datetime.date) >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == >>> lit(datetime.date(2017, 10, 30))).show() ++ |(date(2017, 10, 30) = DATE '2017-10-30')| ++ | false| ++ {code} (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 'date')}} doesn't work either) We should correctly handle Python UDFs that return objects of such types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime
[ https://issues.apache.org/jira/browse/SPARK-22966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kris Mok updated SPARK-22966: - Description: Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand internally, and will thus give incorrect results. e.g. {code:none} >>> import datetime >>> from pyspark.sql import * >>> py_date = udf(datetime.date) >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == >>> lit(datetime.date(2017, 10, 30))).show() ++ |(date(2017, 10, 30) = DATE '2017-10-30')| ++ | false| ++ {code} (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 'date')}} doesn't work either) We should correctly handle Python UDFs that return objects of such types. was: Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand internally, and will thus give incorrect results. e.g. {code:python} >>> import datetime >>> from pyspark.sql import * >>> py_date = udf(datetime.date) >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == >>> lit(datetime.date(2017, 10, 30))).show() ++ |(date(2017, 10, 30) = DATE '2017-10-30')| ++ | false| ++ {code} (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 'date')}} doesn't work either) We should correctly handle Python UDFs that return objects of such types. > Spark SQL should handle Python UDFs that return a datetime.date or > datetime.datetime > > > Key: SPARK-22966 > URL: https://issues.apache.org/jira/browse/SPARK-22966 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Kris Mok > > Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which > should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} > (which should correspond to a Spark SQL {{timestamp}} type), it gets > unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand > internally, and will thus give incorrect results. > e.g. > {code:none} > >>> import datetime > >>> from pyspark.sql import * > >>> py_date = udf(datetime.date) > >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == > >>> lit(datetime.date(2017, 10, 30))).show() > ++ > |(date(2017, 10, 30) = DATE '2017-10-30')| > ++ > | false| > ++ > {code} > (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, > 'date')}} doesn't work either) > We should correctly handle Python UDFs that return objects of such types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21927) Spark pom.xml's dependency management is broken
[ https://issues.apache.org/jira/browse/SPARK-21927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154426#comment-16154426 ] Kris Mok commented on SPARK-21927: -- [~sowen] Could you please take a look and see if recent change to POM (e.g. SPARK-14280) would have caused this issue? Thanks! > Spark pom.xml's dependency management is broken > --- > > Key: SPARK-21927 > URL: https://issues.apache.org/jira/browse/SPARK-21927 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 > Environment: Apache Spark current master (commit > 12ab7f7e89ec9e102859ab3b710815d3058a2e8d) >Reporter: Kris Mok > > When building the current Spark master just now (commit > 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot > of warning messages such as the following. Looks like the dependency > management in the POMs are somehow broken recently. > {code:none} > .../workspace/apache-spark/master (master) $ build/sbt clean package > Attempting to fetch sbt > Launching sbt from build/sbt-launch-0.13.16.jar > [info] Loading project definition from > .../workspace/apache-spark/master/project > [info] Updating > {file:.../workspace/apache-spark/master/project/}master-build... > [info] Resolving org.fusesource.jansi#jansi;1.4 ... > [info] downloading > https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar > ... > [info] [SUCCESSFUL ] > org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms) > [info] downloading > https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar > ... > [info] [SUCCESSFUL ] > org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms) > [info] Done updating. > [warn] Found version conflict(s) in library dependencies; some are suspected > to be binary incompatible: > [warn] > [warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over > 1.0-beta-6 > [warn] +- org.apache.maven:maven-compat:3.0.4(depends > on 2.2) > [warn] +- org.apache.maven.wagon:wagon-file:2.2 (depends > on 2.2) > [warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark > (scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2) > [warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2 (depends > on 2.2) > [warn] +- org.apache.maven.wagon:wagon-http:2.2 (depends > on 2.2) > [warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2 (depends > on 2.2) > [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1 (depends > on 1.0-beta-6) > [warn] > [warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, > 2.0.6, 2.1, 1.5.5} > [warn] +- org.apache.maven.wagon:wagon-provider-api:2.2 (depends > on 3.0) > [warn] +- org.apache.maven:maven-compat:3.0.4(depends > on 2.0.6) > [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-artifact:3.0.4 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-core:3.0.4 (depends > on 2.0.6) > [warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-embedder:3.0.4 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-settings:3.0.4 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-settings-builder:3.0.4 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends > on 2.0.7) > [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1 (depends > on 2.0.7) > [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends > on 2.0.7) > [warn] +- org.apache.maven:maven-model:3.0.4 (depends > on 2.0.7) > [warn] +- org.apache.maven:maven-aether-provider:3.0.4 (depends > on 2.0.7) > [warn] +- org.apache.maven:maven-repository-metadata:3.0.4 (depends > on 2.0.7) > [warn] > [warn] * cglib:cglib is evicted completely > [warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends > on 2.2.2) > [warn] > [warn] * asm:asm is evicted completely > [warn] +- cglib:cglib:2.2.2 (depends > on 3.3.1) > [warn] > [warn] Run 'evicted' to see detailed eviction warnings > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21927) Spark pom.xml's dependency management is broken
Kris Mok created SPARK-21927: Summary: Spark pom.xml's dependency management is broken Key: SPARK-21927 URL: https://issues.apache.org/jira/browse/SPARK-21927 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.1 Environment: Apache Spark current master (commit 12ab7f7e89ec9e102859ab3b710815d3058a2e8d) Reporter: Kris Mok When building the current Spark master just now (commit 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot of warning messages such as the following. Looks like the dependency management in the POMs are somehow broken recently. {code:none} .../workspace/apache-spark/master (master) $ build/sbt clean package Attempting to fetch sbt Launching sbt from build/sbt-launch-0.13.16.jar [info] Loading project definition from .../workspace/apache-spark/master/project [info] Updating {file:.../workspace/apache-spark/master/project/}master-build... [info] Resolving org.fusesource.jansi#jansi;1.4 ... [info] downloading https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar ... [info] [SUCCESSFUL ] org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms) [info] downloading https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar ... [info] [SUCCESSFUL ] org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms) [info] Done updating. [warn] Found version conflict(s) in library dependencies; some are suspected to be binary incompatible: [warn] [warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 1.0-beta-6 [warn] +- org.apache.maven:maven-compat:3.0.4(depends on 2.2) [warn] +- org.apache.maven.wagon:wagon-file:2.2 (depends on 2.2) [warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark (scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2) [warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2 (depends on 2.2) [warn] +- org.apache.maven.wagon:wagon-http:2.2 (depends on 2.2) [warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2 (depends on 2.2) [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1 (depends on 1.0-beta-6) [warn] [warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 2.0.6, 2.1, 1.5.5} [warn] +- org.apache.maven.wagon:wagon-provider-api:2.2 (depends on 3.0) [warn] +- org.apache.maven:maven-compat:3.0.4(depends on 2.0.6) [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends on 2.0.6) [warn] +- org.apache.maven:maven-artifact:3.0.4 (depends on 2.0.6) [warn] +- org.apache.maven:maven-core:3.0.4 (depends on 2.0.6) [warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3 (depends on 2.0.6) [warn] +- org.apache.maven:maven-embedder:3.0.4 (depends on 2.0.6) [warn] +- org.apache.maven:maven-settings:3.0.4 (depends on 2.0.6) [warn] +- org.apache.maven:maven-settings-builder:3.0.4 (depends on 2.0.6) [warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends on 2.0.7) [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1 (depends on 2.0.7) [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends on 2.0.7) [warn] +- org.apache.maven:maven-model:3.0.4 (depends on 2.0.7) [warn] +- org.apache.maven:maven-aether-provider:3.0.4 (depends on 2.0.7) [warn] +- org.apache.maven:maven-repository-metadata:3.0.4 (depends on 2.0.7) [warn] [warn] * cglib:cglib is evicted completely [warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends on 2.2.2) [warn] [warn] * asm:asm is evicted completely [warn] +- cglib:cglib:2.2.2 (depends on 3.3.1) [warn] [warn] Run 'evicted' to see detailed eviction warnings {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21041) With whole-stage codegen, SparkSession.range()'s behavior is inconsistent with SparkContext.range()
[ https://issues.apache.org/jira/browse/SPARK-21041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kris Mok updated SPARK-21041: - Description: When whole-stage codegen is enabled, in face of integer overflow, SparkSession.range()'s behavior is inconsistent with when codegen is turned off, while the latter is consistent with SparkContext.range()'s behavior. The following Spark Shell session shows the inconsistency: {code:java} scala> sc.range def range(start: Long,end: Long,step: Long,numSlices: Int): org.apache.spark.rdd.RDD[Long] scala> spark.range def range(start: Long,end: Long,step: Long,numPartitions: Int): org.apache.spark.sql.Dataset[Long] def range(start: Long,end: Long,step: Long): org.apache.spark.sql.Dataset[Long] def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long] def range(end: Long): org.apache.spark.sql.Dataset[Long] scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res1: Array[Long] = Array() scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806) scala> spark.conf.set("spark.sql.codegen.wholeStage", false) scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res5: Array[Long] = Array() {code} was: When whole-stage codegen is enabled, in face of integer overflow, SparkSession.range()'s behavior is inconsistent with when codegen is turned off, while the latter is consistent with SparkContext.range()'s behavior. The following Spark Shell session shows the inconsistency: {code:scala} scala> sc.range def range(start: Long,end: Long,step: Long,numSlices: Int): org.apache.spark.rdd.RDD[Long] scala> spark.range def range(start: Long,end: Long,step: Long,numPartitions: Int): org.apache.spark.sql.Dataset[Long] def range(start: Long,end: Long,step: Long): org.apache.spark.sql.Dataset[Long] def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long] def range(end: Long): org.apache.spark.sql.Dataset[Long] scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res1: Array[Long] = Array() scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806) scala> spark.conf.set("spark.sql.codegen.wholeStage", false) scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res5: Array[Long] = Array() {code} > With whole-stage codegen, SparkSession.range()'s behavior is inconsistent > with SparkContext.range() > --- > > Key: SPARK-21041 > URL: https://issues.apache.org/jira/browse/SPARK-21041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kris Mok > > When whole-stage codegen is enabled, in face of integer overflow, > SparkSession.range()'s behavior is inconsistent with when codegen is turned > off, while the latter is consistent with SparkContext.range()'s behavior. > The following Spark Shell session shows the inconsistency: > {code:java} > scala> sc.range >def range(start: Long,end: Long,step: Long,numSlices: Int): > org.apache.spark.rdd.RDD[Long] > scala> spark.range > > > def range(start: Long,end: Long,step: Long,numPartitions: Int): > org.apache.spark.sql.Dataset[Long] > def range(start: Long,end: Long,step: Long): > org.apache.spark.sql.Dataset[Long] > def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long] > > def range(end: Long): org.apache.spark.sql.Dataset[Long] > scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, > 1).collect > res1: Array[Long] = Array() > scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + > 2, 1).collect > res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, > 9223372036854775806) > scala> spark.conf.set("spark.sql.codegen.wholeStage", false) > scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + > 2, 1).collect > res5: Array[Long] = Array() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (SPARK-21041) With whole-stage codegen, SparkSession.range()'s behavior is inconsistent with SparkContext.range()
Kris Mok created SPARK-21041: Summary: With whole-stage codegen, SparkSession.range()'s behavior is inconsistent with SparkContext.range() Key: SPARK-21041 URL: https://issues.apache.org/jira/browse/SPARK-21041 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kris Mok When whole-stage codegen is enabled, in face of integer overflow, SparkSession.range()'s behavior is inconsistent with when codegen is turned off, while the latter is consistent with SparkContext.range()'s behavior. The following Spark Shell session shows the inconsistency: {code:scala} scala> sc.range def range(start: Long,end: Long,step: Long,numSlices: Int): org.apache.spark.rdd.RDD[Long] scala> spark.range def range(start: Long,end: Long,step: Long,numPartitions: Int): org.apache.spark.sql.Dataset[Long] def range(start: Long,end: Long,step: Long): org.apache.spark.sql.Dataset[Long] def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long] def range(end: Long): org.apache.spark.sql.Dataset[Long] scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res1: Array[Long] = Array() scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806) scala> spark.conf.set("spark.sql.codegen.wholeStage", false) scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res5: Array[Long] = Array() {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21038) Reduce redundant generated init code in Catalyst codegen
Kris Mok created SPARK-21038: Summary: Reduce redundant generated init code in Catalyst codegen Key: SPARK-21038 URL: https://issues.apache.org/jira/browse/SPARK-21038 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Kris Mok Priority: Minor In Java, instance fields are guaranteed to be first initialized to their corresponding default values (zero values) before the constructor is invoked. Thus, explicitly code to initialize fields to their zero values is redundant and should be avoided. It's usually harmless to have such code in hand-written constructors, but in the case of mechanically generating code, such code could contribute to a significant portion of the code size and cause issues. This ticket is a step in reducing the likelihood of hitting the 64KB bytecode method size limit in the Java Class files. The proposal is to use some simple heuristics to filter out redundant code of initializing mutable state to their default (zero) values in CodegenContext.addMutableState. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20872) ShuffleExchange.nodeName should handle null coordinator
[ https://issues.apache.org/jira/browse/SPARK-20872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kris Mok updated SPARK-20872: - Description: A ShuffleExchange's coordinator can be null sometimes, and when we need to do a toString() on it, it'll go to ShuffleExchange.nodeName() and throw a MatchError there because of inexhaustive match -- the match only handles Some and None, but doesn't handle null. An example of this issue is when trying to inspect a Catalyst physical operator on the Executor side in an IDE: {code:none} child = {WholeStageCodegenExec@13881} Method threw 'scala.MatchError' exception. Cannot evaluate org.apache.spark.sql.execution.WholeStageCodegenExec.toString() {code} where this WholeStageCodegenExec transitively references a ShuffleExchange. On the Executor side, this ShuffleExchange instance is deserialized from the data sent over from the Driver, and because the coordinator field is marked transient, it's not carried over to the Executor, that's why it can be null upon inspection. was: A ShuffleExchange's coordinator can be null sometimes, and when we need to do a toString() on it, it'll go to ShuffleExchange.nodeName() and throw a MatchError there because of inexhaustive match -- the match only handles Some and None, but doesn't handle null. An example of this issue is when trying to inspect a Catalyst physical operator in an IDE: {code:none} child = {WholeStageCodegenExec@13881} Method threw 'scala.MatchError' exception. Cannot evaluate org.apache.spark.sql.execution.WholeStageCodegenExec.toString() {code} where this WholeStageCodegenExec transitively references a ShuffleExchange. > ShuffleExchange.nodeName should handle null coordinator > --- > > Key: SPARK-20872 > URL: https://issues.apache.org/jira/browse/SPARK-20872 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kris Mok >Priority: Trivial > Labels: easyfix > > A ShuffleExchange's coordinator can be null sometimes, and when we need to do > a toString() on it, it'll go to ShuffleExchange.nodeName() and throw a > MatchError there because of inexhaustive match -- the match only handles Some > and None, but doesn't handle null. > An example of this issue is when trying to inspect a Catalyst physical > operator on the Executor side in an IDE: > {code:none} > child = {WholeStageCodegenExec@13881} Method threw 'scala.MatchError' > exception. Cannot evaluate > org.apache.spark.sql.execution.WholeStageCodegenExec.toString() > {code} > where this WholeStageCodegenExec transitively references a ShuffleExchange. > On the Executor side, this ShuffleExchange instance is deserialized from the > data sent over from the Driver, and because the coordinator field is marked > transient, it's not carried over to the Executor, that's why it can be null > upon inspection. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20872) ShuffleExchange.nodeName should handle null coordinator
[ https://issues.apache.org/jira/browse/SPARK-20872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023421#comment-16023421 ] Kris Mok commented on SPARK-20872: -- The said matching logic in ShuffleExchange.nodeName() is introduced from SPARK-9858. > ShuffleExchange.nodeName should handle null coordinator > --- > > Key: SPARK-20872 > URL: https://issues.apache.org/jira/browse/SPARK-20872 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kris Mok >Priority: Trivial > Labels: easyfix > > A ShuffleExchange's coordinator can be null sometimes, and when we need to do > a toString() on it, it'll go to ShuffleExchange.nodeName() and throw a > MatchError there because of inexhaustive match -- the match only handles Some > and None, but doesn't handle null. > An example of this issue is when trying to inspect a Catalyst physical > operator in an IDE: > {code:none} > child = {WholeStageCodegenExec@13881} Method threw 'scala.MatchError' > exception. Cannot evaluate > org.apache.spark.sql.execution.WholeStageCodegenExec.toString() > {code} > where this WholeStageCodegenExec transitively references a ShuffleExchange. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20872) ShuffleExchange.nodeName should handle null coordinator
Kris Mok created SPARK-20872: Summary: ShuffleExchange.nodeName should handle null coordinator Key: SPARK-20872 URL: https://issues.apache.org/jira/browse/SPARK-20872 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kris Mok Priority: Trivial A ShuffleExchange's coordinator can be null sometimes, and when we need to do a toString() on it, it'll go to ShuffleExchange.nodeName() and throw a MatchError there because of inexhaustive match -- the match only handles Some and None, but doesn't handle null. An example of this issue is when trying to inspect a Catalyst physical operator in an IDE: {code:none} child = {WholeStageCodegenExec@13881} Method threw 'scala.MatchError' exception. Cannot evaluate org.apache.spark.sql.execution.WholeStageCodegenExec.toString() {code} where this WholeStageCodegenExec transitively references a ShuffleExchange. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20482) Resolving Casts is too strict on having time zone set
Kris Mok created SPARK-20482: Summary: Resolving Casts is too strict on having time zone set Key: SPARK-20482 URL: https://issues.apache.org/jira/browse/SPARK-20482 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kris Mok -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org