from:"Kris Mok \(Jira\)"

[jira] [Created] (SPARK-43973) Structured Streaming UI should display failed queries correctly

2023-06-05 Thread Kris Mok (Jira)

Kris Mok created SPARK-43973:


 Summary: Structured Streaming UI should display failed queries 
correctly
 Key: SPARK-43973
 URL: https://issues.apache.org/jira/browse/SPARK-43973
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.0
Reporter: Kris Mok


The Structured Streaming UI is designed to be able to show a query's status 
(active/finished/failed) and if failed, the error message.
Due to a bug in the implementation, the error message in 
{{QueryTerminatedEvent}} isn't being tracked by the UI data, so in turn the UI 
always shows failed queries as "finished".

Example:
{code:scala}
implicit val ctx = spark.sqlContext
import org.apache.spark.sql.execution.streaming.MemoryStream

spark.conf.set("spark.sql.ansi.enabled", "true")

val inputData = MemoryStream[(Int, Int)]

val df = inputData.toDF().selectExpr("_1 / _2 as a")

inputData.addData((1, 2), (3, 4), (5, 6), (7, 0))
val testQuery = 
df.writeStream.format("memory").queryName("kristest").outputMode("append").start
testQuery.processAllAvailable()
{code}

Here we intentionally fail a query, but the Spark UI's Structured Streaming tab 
would show this as "FINISHED" without any errors, which is wrong.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42851) EquivalentExpressions methods need to be consistently guarded by supportedExpression

2023-03-17 Thread Kris Mok (Jira)

Kris Mok created SPARK-42851:


 Summary: EquivalentExpressions methods need to be consistently 
guarded by supportedExpression
 Key: SPARK-42851
 URL: https://issues.apache.org/jira/browse/SPARK-42851
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2, 3.4.0
Reporter: Kris Mok


SPARK-41468 tried to fix a bug but introduced a new regression. Its change to 
{{EquivalentExpressions}} added a {{supportedExpression()}} guard to the 
{{addExprTree()}} and {{getExprState()}} methods, but didn't add the same guard 
to the other "add" entry point -- {{addExpr()}}.

As such, uses that add single expressions to CSE via {{addExpr()}} may succeed, 
but upon retrieval via {{getExprState()}} it'd inconsistently get a {{None}} 
due to failing the guard.

We need to make sure the "add" and "get" methods are consistent. It could be 
done by one of:
1. Adding the same {{supportedExpression()}} guard to {{addExpr()}}, or
2. Removing the guard from {{getExprState()}}, relying solely on the guard on 
the "add" path to make sure only intended state is added.
(or other alternative refactorings to fuse the guard into various methods to 
make it more efficient)

There are pros and cons to the two directions above, because {{addExpr()}} used 
to allow (potentially incorrect) more expressions to get CSE'd, making it more 
restrictive may cause performance regressions (for the cases that happened to 
work).

Example:
{code:sql}
select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) 
from range(2)
{code}

Running this query on Spark 3.2 branch returns the correct value:
{code}
scala> spark.sql("select max(transform(array(id), x -> x)), 
max(transform(array(id), x -> x)) from range(2)").collect
res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(1),WrappedArray(1)])
{code}
Here, {{transform(array(id), x -> x)}} is an {{AggregateExpression}} that was 
(potentially unsafely) recognized by {{addExpr()}} as a common subexpression, 
and {{getExprState()}} doesn't do extra guarding, so during physical planning, 
in {{PhysicalAggregation}} this expression gets CSE'd in both the aggregation 
expression list and the result expressions list.
{code}
AdaptiveSparkPlan isFinalPlan=false
+- SortAggregate(key=[], functions=[max(transform(array(id#0L), 
lambdafunction(lambda x#1L, lambda x#1L, false)))])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
  +- SortAggregate(key=[], functions=[partial_max(transform(array(id#0L), 
lambdafunction(lambda x#1L, lambda x#1L, false)))])
 +- Range (0, 2, step=1, splits=16)
{code}

Running the same query on current master triggers an error when binding the 
result expression to the aggregate expression in the Aggregate operators (for a 
WSCG-enabled operator like {{HashAggregateExec}}, the same error would show up 
during codegen):
{code}
ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 
16) (ip-10-110-16-93.us-west-2.compute.internal executor driver): 
java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), 
lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in 
[max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, 
false)))#3]
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:517)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1248)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:532)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:517)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:456)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
at

[jira] [Created] (SPARK-40380) Constant-folding of InvokeLike should not result in non-serializable result

2022-09-07 Thread Kris Mok (Jira)

Kris Mok created SPARK-40380:


 Summary: Constant-folding of InvokeLike should not result in 
non-serializable result
 Key: SPARK-40380
 URL: https://issues.apache.org/jira/browse/SPARK-40380
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kris Mok


SPARK-37907 added constant-folding support to the {{InvokeLike}} family of 
expressions. Unfortunately it introduced a regression for cases when a 
constant-folded {{InvokeLike}} expression returned a non-serializable result. 
{{ExpressionEncoder}}s is an area where this problem may be exposed, e.g. when 
using sparksql-scalapb on Spark 3.3.0+.

Below is a minimal repro to demonstrate this issue:
{code:scala}
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, StaticInvoke}
import org.apache.spark.sql.types.{LongType, ObjectType}
class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = 
longVal + other }
case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): 
NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) }
val litExpr = Literal.fromObject(SerializableBoxedLong(42L), 
ObjectType(classOf[SerializableBoxedLong]))
val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", 
ObjectType(classOf[NotSerializableBoxedLong]))
val addExpr = Invoke(toNotSerializableExpr, "add", LongType, 
Seq(UnresolvedAttribute.quotedString("id")))
val df = spark.range(2).select(new Column(addExpr))
df.collect
{code}

Before SPARK-37907, this example would run fine and result in {{[[42], [43]]}}. 
But after SPARK-37907, it'd fail with:
{code:none}
...
Caused by: java.io.NotSerializableException: NotSerializableBoxedLong
Serialization stack:
- object not serializable (class: NotSerializableBoxedLong, value: 
NotSerializableBoxedLong@71231636)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 2)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, 
SerializedLambda[capturingClass=class 
org.apache.spark.sql.execution.WholeStageCodegenExec, 
functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
 implementation=invokeStatic 
org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
 
instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
 numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class 
org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, 
org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389@185db22c)
  at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
  at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49)
  at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115)
  at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40303) The performance will be worse after codegen

2022-09-02 Thread Kris Mok (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599616#comment-17599616
 ] 

Kris Mok commented on SPARK-40303:
--

Nice findings [~LuciferYang]!

{quote}
After some experiments, I found when the number of parameters exceeds 50, the 
performance of the case in the Jira description will significant deterioration.
{quote}
Sounds reasonable. Note that in a STD compilation, the only things that need to 
be live at the method entry are the method parameters (both implicit ones like 
{{this}}, and explicit ones); however, for an OSR compilation, it would be all 
of the parameters/local variables that are live at the loop entry point, so in 
this case both the {{doConsume}} parameters and the local variables contribute 
to the problem.

Just FYI I have an old write on PrintCompilation and OSR here: 
https://gist.github.com/rednaxelafx/1165804#file-notes-md
(Gee, just realized that was from 11 years ago...)

{quote}
maybe try to make the input parameters of the `doConsume` method fixed length 
will help, such as using a List or Array
{quote}
Welp, hoisting the parameters into an Arguments object is rather common in 
"code splitting" in code generators. Since we're already doing codegen, it's 
possible to generate tailor-made Arguments classes to retain the type 
information. Using List/Array would require extra boxing for primitive types 
and it's less ideal.
(An array-based box is already used in Spark SQL's codegen in the form of the 
{{references}} array. Indeed the type info is lost on the interface level and 
you'd have to do a cast when you get data out of it. It's still usable though.)

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40303) The performance will be worse after codegen

2022-09-02 Thread Kris Mok (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599302#comment-17599302
 ] 

Kris Mok commented on SPARK-40303:
--

Interesting, thanks for posting your findings, [~LuciferYang]!

In the JDK8 case, the message:
{code}
COMPILE SKIPPED: unsupported calling sequence (not retryable)
{code}
Indicates that the compiler deemed this method should not be attempted to 
compile again on any tier of compilation, and because this is an OSR 
compilation (i.e. loop compilation), this will mark the method as "never try to 
perform OSR compilation again on all tiers". On JDK17 this is relaxed a bit and 
it'll try tier 1.

Note that OSR compilation and STD compilation (normal compilation) are separate 
things. Marking one as not compilation doesn't affect the other. I haven't 
checked the IR yet but if I had to guess, the reason why this method is 
recorded as not OSR compilable is because there are too many live local 
variables at the OSR (loop) entry point, beyond what the HotSpot JVM could 
support.
So only OSR is affected, STD should still be fine.

In general, the tiered compilation system in the HotSpot JVM works as:
- tier 0: interpreter
- tier 1: C1 no profiling (best code quality for C1, same as HotSpot Client 
VM's C1)
- tier 2: C1 basic profiling (lower performance than tier 1, only used when the 
target level is tier 3 but the C1 queue is too long)
- tier 3: C1 full profiling (lower performance than tier 2, for collecting 
profile to perform tier 4 profile-guided optimization)
- tier 4: C2

As such, tiers 2 and 3 are only useful if tier 4 is available. So if a method 
is recorded as "not compilable on tier 4", the only realistic option left is to 
try tier 1.

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39839) Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check

2022-07-22 Thread Kris Mok (Jira)

Kris Mok created SPARK-39839:


 Summary: Handle special case of null variable-length Decimal with 
non-zero offsetAndSize in UnsafeRow structural integrity check
 Key: SPARK-39839
 URL: https://issues.apache.org/jira/browse/SPARK-39839
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.1.0
Reporter: Kris Mok


The {{UnsafeRow}} structural integrity check in 
{{UnsafeRowUtils.validateStructuralIntegrity}} is added in Spark 3.1.0. It’s 
supposed to validate that a given {{UnsafeRow}} conforms to the format that the 
{{UnsafeRowWriter}} would have produced.

Currently the check expects all fields that are marked as null should also have 
its field (i.e. the fixed-length part) set to all zeros. It needs to be updated 
to handle a special case for variable-length {{{}Decimal{}}}s, where the 
{{UnsafeRowWriter}} may mark a field as null but also leave the fixed-length 
part of the field as {{OffsetAndSize(offset=current_offset, size=0)}}. This may 
happen when the {{Decimal}} being written is either a real {{null}} or has 
overflowed the specified precision.

Logic in {{UnsafeRowWriter}}:

in general:
{code:scala}
  public void setNullAt(int ordinal) {
    BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit
    write(ordinal, 0L);                                      // also zero out 
the fixed-length field
  } {code}
special case for {{DecimalType}}:
{code:scala}
      // Make sure Decimal object has the same scale as DecimalType.
      // Note that we may pass in null Decimal object to set null for it.
      if (input == null || !input.changePrecision(precision, scale)) {
        BitSetMethods.set(getBuffer(), startingOffset, ordinal); // set null bit
        // keep the offset for future update
        setOffsetAndSize(ordinal, 0);                            // doesn't 
zero out the fixed-length field
      } {code}
The special case is introduced to allow all {{DecimalType}}s (including both 
fixed-length and variable-length ones) to be mutable – thus need to leave space 
for the variable-length field even if it’s currently null.

Note that this special case in {{UnsafeRowWriter}} has been there since Spark 
1.6.0, where as the integrity check was added in Spark 3.1.0. The check was 
originally added for Structured Streaming’s checkpoint evolution validation, so 
that a newer version of Spark can check whether or not an older checkpoint file 
for Structured Streaming queries can be supported, and/or if the contents of 
the checkpoint file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39140) JavaSerializer doesn't serialize the fields of superclass

2022-05-10 Thread Kris Mok (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534633#comment-17534633
 ] 

Kris Mok commented on SPARK-39140:
--

This isn't a "bug" in Spark's {{JavaSerializer}} per-se, but a feature that's 
working by-design that could be overlooked by the users of this class.

The underlying issue can be demonstrated with a simple tweak to the code 
snippet in the issue description:
{code:scala}
abstract class AA extends Serializable {
  val ts = System.nanoTime()
}
case class BB(x: Int) extends AA {
}
{code}
Now run the rest of the original code snipper and you should see that 
{{input.ts}} and {{obj1.ts}} are now the same, i.e. it's correctly serialized 
and then deserialized.

What is it telling us? Well, in Scala, {{case class}} imply implementing the 
{{scala.Serializable}} trait, which has the effect of {{java.io.Serializable}}. 
The {{JavaSerializer}} uses Java's standard library's builtin serialization 
mechanism, which specifies that only fields on {{Serializable}} classes are 
serialized by default. Effectively, you can have a subclass that is 
serializable, but fields on all of its non-serializable supertypes will just be 
ignored.

There are ways to customize the Java serialization mechanism to make subclasses 
also write out fields from the superclass, but there's no builtin way to do 
that -- you have to roll your own code. Scala doesn't help here either.

Reflection-based third-party serializers, like Kryo in this example, usually 
ignores the language-level serialization markers and just serializes everything 
-- except for those explicitly marked as transient (excluded from 
serialization). That's how {{KryoSerializer}} "works" here.

https://docs.oracle.com/javase/8/docs/platform/serialization/spec/serial-arch.html#a4176
{quote}Special handling is required for arrays, enum constants, and objects of 
type Class, ObjectStreamClass, and String. Other objects must implement either 
the Serializable or the Externalizable interface to be saved in or restored 
from a stream.{quote}

and 
https://docs.oracle.com/javase/8/docs/platform/serialization/spec/output.html#a861
{quote}Each subclass of a serializable object may define its own writeObject 
method. If a class does not implement the method, the default serialization 
provided by defaultWriteObject will be used. When implemented, the class is 
only responsible for writing its own fields, not those of its supertypes or 
subtypes.{quote}

and https://docs.oracle.com/javase/8/docs/api/java/io/ObjectOutputStream.html
{quote}Serialization does not write out the fields of any object that does not 
implement the java.io.Serializable interface. Subclasses of Objects that are 
not serializable can be serializable. In this case the non-serializable class 
must have a no-arg constructor to allow its fields to be initialized. In this 
case it is the responsibility of the subclass to save and restore the state of 
the non-serializable class. It is frequently the case that the fields of that 
class are accessible (public, package, or protected) or that there are get and 
set methods that can be used to restore the state.{quote}

> JavaSerializer doesn't serialize the fields of superclass
> -
>
> Key: SPARK-39140
> URL: https://issues.apache.org/jira/browse/SPARK-39140
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Priority: Major
>
> To reproduce:
>  
> {code:java}
> abstract class AA {
>   val ts = System.nanoTime()
> }
> case class BB(x: Int) extends AA {
> }
> val input = BB(1)
> println("original ts: " + input.ts)
> val javaSerializer = new JavaSerializer(new SparkConf())
> val javaInstance = javaSerializer.newInstance()
> val bytes1 = javaInstance.serialize[BB](input)
> val obj1 = javaInstance.deserialize[BB](bytes1)
> println("deserialization result from java: " + obj1.ts)
> val kryoSerializer = new KryoSerializer(new SparkConf())
> val kryoInstance = kryoSerializer.newInstance()
> val bytes2 = kryoInstance.serialize[BB](input)
> val obj2 = kryoInstance.deserialize[BB](bytes2)
> println("deserialization result from kryo: " + obj2.ts) {code}
>  
>  
> The output is
>  
> {code:java}
> original ts: 115014173658666
> deserialization result from java: 115014306794333
> deserialization result from kryo: 115014173658666{code}
>  
> We can see that the fields from the superclass AA are not serialized with 
> JavaSerializer. When switching to KryoSerializer, it works.
> This caused bugs in the project SPARK-38615: TreeNode.origin with actual 
> information is not serialized to executors when a plan can't be executed with 
> whole-staged-codegen.
> It could also lead to bugs in serializing the lambda function within RDD API 
> like 
>

[jira] [Updated] (SPARK-34607) NewInstance.resolved should not throw malformed class name error

2021-03-03 Thread Kris Mok (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Mok updated SPARK-34607:
-
Description: 
I'd like to seek for community help on fixing this issue:

Related to SPARK-34596, one of our users had hit an issue with 
{{ExpressionEncoder}} when running Spark code in a Scala REPL, where 
{{NewInstance.resolved}} was throwing {{"Malformed class name"}} error, with 
the following kind of stack trace:
{code}
java.lang.InternalError: Malformed class name
at java.lang.Class.getSimpleBinaryName(Class.java:1450)
at java.lang.Class.isMemberClass(Class.java:1433)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionBottomUp(Analyzer.scala:1935)
...
 Caused by: sbt.ForkMain$ForkError: java.lang.StringIndexOutOfBoundsException: 
String index out of range: -83
at java.lang.String.substring(String.java:1931)
at java.lang.Class.getSimpleBinaryName(Class.java:1448)
at java.lang.Class.isMemberClass(Class.java:1433)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441)
  ...
{code}

The most important point in the stack trace is this:
{code}
java.lang.InternalError: Malformed class name
at java.lang.Class.getSimpleBinaryName(Class.java:1450)
at java.lang.Class.isMemberClass(Class.java:1433)
{code}
The most common way to hit the {{"Malformed class name"}} issue in Spark is via 
{{java.lang.Class.getSimpleName}}. But as this stack trace demonstrates, it can 
happen via other code paths from the JDK as well.

If we want to fix it in a similar fashion as {{Utils.getSimpleName}}, we'd have 
to emulate {{java.lang.Class.isMemberClass}} in Spark's {{Utils}}, and then use 
it in the {{NewInstance.resolved}} code path.

Here's a reproducer test case (in diff form against Spark master's 
{{ExpressionEncoderSuite}} ), which uses explicit nested classes to emulate the 
code structure that'd be generated by Scala's REPL:
(the level of nesting can be further reduced and still reproduce the issue)
{code}
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
index 2635264..fd1b23d 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
@@ -217,6 +217,95 @@ class ExpressionEncoderSuite extends 
CodegenInterpretedPlanTest with AnalysisTes
   "nested Scala class should work")
   }
 
+  object OuterLevelWithVeryVeryVeryLongClassName1 {
+object OuterLevelWithVeryVeryVeryLongClassName2 {
+  object OuterLevelWithVeryVeryVeryLongClassName3 {
+object OuterLevelWithVeryVeryVeryLongClassName4 {
+  object OuterLevelWithVeryVeryVeryLongClassName5 {
+object OuterLevelWithVeryVeryVeryLongClassName6 {
+  object OuterLevelWithVeryVeryVeryLongClassName7 {
+object OuterLevelWithVeryVeryVeryLongClassName8 {
+  object OuterLevelWithVeryVeryVeryLongClassName9 {
+object OuterLevelWithVeryVeryVeryLongClassName10 {
+  object OuterLevelWithVeryVeryVeryLongClassName11 {
+object OuterLevelWithVeryVeryVeryLongClassName12 {
+  object OuterLevelWithVeryVeryVeryLongClassName13 {
+object OuterLevelWithVeryVeryVeryLongClassName14 {
+  object OuterLevelWithVeryVeryVeryLongClassName15 
{
+object 
OuterLevelWithVeryVeryVeryLongClassName16 {
+  object 
OuterLevelWithVeryVeryVeryLongClassName17 {
+object 
OuterLevelWithVeryVeryVeryLongClassName18 {
+  object 
OuterLevelWithVeryVeryVeryLongClassName19 {
+object 
OuterLevelWithVeryVeryVeryLongClassName20 {
+  case class MalformedNameExample2(x: 
Int)
+}
+  }
+}
+  }
+}
+  }
+}
+  }
+}
+  }
+}
+

[jira] [Updated] (SPARK-34607) NewInstance.resolved should not throw malformed class name error

2021-03-03 Thread Kris Mok (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Mok updated SPARK-34607:
-
Description: 
I'd like to seek for community help on fixing this issue:

Related to SPARK-34596, one of our users had hit an issue with 
{{ExpressionEncoder}} when running Spark code in a Scala REPL, where 
{{NewInstance.resolved}} was throwing {{"Malformed class name"}} error, with 
the following kind of stack trace:
{code}
java.lang.InternalError: Malformed class name
at java.lang.Class.getSimpleBinaryName(Class.java:1450)
at java.lang.Class.isMemberClass(Class.java:1433)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionBottomUp(Analyzer.scala:1935)
...
 Caused by: sbt.ForkMain$ForkError: java.lang.StringIndexOutOfBoundsException: 
String index out of range: -83
at java.lang.String.substring(String.java:1931)
at java.lang.Class.getSimpleBinaryName(Class.java:1448)
at java.lang.Class.isMemberClass(Class.java:1433)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441)
  ...
{code}

The most important point in the stack trace is this:
{code}
java.lang.InternalError: Malformed class name
at java.lang.Class.getSimpleBinaryName(Class.java:1450)
at java.lang.Class.isMemberClass(Class.java:1433)
{code}
The most common way to hit the {{"Malformed class name"}} issue in Spark is via 
{{java.lang.Class.getSimpleName}}. But as this stack trace demonstrates, it can 
happen via other code paths from the JDK as well.

If we want to fix it in a similar fashion as {{Utils.getSimpleName}}, we'd have 
to emulate {{java.lang.Class.isMemberClass}} in Spark's {{Utils}}, and then use 
it in the {{NewInstance.resolved}} code path.

Here's a reproducer test case (in diff form against Spark master's 
{{ExpressionEncoderSuite}} ):
(the level of nesting can be further reduced and still reproduce the issue)
{code}
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
index 2635264..fd1b23d 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
@@ -217,6 +217,95 @@ class ExpressionEncoderSuite extends 
CodegenInterpretedPlanTest with AnalysisTes
   "nested Scala class should work")
   }
 
+  object OuterLevelWithVeryVeryVeryLongClassName1 {
+object OuterLevelWithVeryVeryVeryLongClassName2 {
+  object OuterLevelWithVeryVeryVeryLongClassName3 {
+object OuterLevelWithVeryVeryVeryLongClassName4 {
+  object OuterLevelWithVeryVeryVeryLongClassName5 {
+object OuterLevelWithVeryVeryVeryLongClassName6 {
+  object OuterLevelWithVeryVeryVeryLongClassName7 {
+object OuterLevelWithVeryVeryVeryLongClassName8 {
+  object OuterLevelWithVeryVeryVeryLongClassName9 {
+object OuterLevelWithVeryVeryVeryLongClassName10 {
+  object OuterLevelWithVeryVeryVeryLongClassName11 {
+object OuterLevelWithVeryVeryVeryLongClassName12 {
+  object OuterLevelWithVeryVeryVeryLongClassName13 {
+object OuterLevelWithVeryVeryVeryLongClassName14 {
+  object OuterLevelWithVeryVeryVeryLongClassName15 
{
+object 
OuterLevelWithVeryVeryVeryLongClassName16 {
+  object 
OuterLevelWithVeryVeryVeryLongClassName17 {
+object 
OuterLevelWithVeryVeryVeryLongClassName18 {
+  object 
OuterLevelWithVeryVeryVeryLongClassName19 {
+object 
OuterLevelWithVeryVeryVeryLongClassName20 {
+  case class MalformedNameExample2(x: 
Int)
+}
+  }
+}
+  }
+}
+  }
+}
+  }
+}
+  }
+}
+  }
+}
+  }
+}
+  }
+}
+  }
+}
+  }

[jira] [Created] (SPARK-34607) NewInstance.resolved should not throw malformed class name error

2021-03-03 Thread Kris Mok (Jira)

Kris Mok created SPARK-34607:


 Summary: NewInstance.resolved should not throw malformed class 
name error
 Key: SPARK-34607
 URL: https://issues.apache.org/jira/browse/SPARK-34607
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.0.2, 2.4.7
Reporter: Kris Mok


I'd like to seek for community help on fixing this issue:

Related to SPARK-34596, one of our users had hit an issue with 
{{ExpressionEncoder}} when running Spark code in a Scala REPL, where 
{{NewInstance.resolved}} was throwing {{"Malformed class name"}} error, with 
the following kind of stack trace:
{code}
java.lang.InternalError: Malformed class name
at java.lang.Class.getSimpleBinaryName(Class.java:1450)
at java.lang.Class.isMemberClass(Class.java:1433)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionBottomUp(Analyzer.scala:1935)
...
 Caused by: sbt.ForkMain$ForkError: java.lang.StringIndexOutOfBoundsException: 
String index out of range: -83
at java.lang.String.substring(String.java:1931)
at java.lang.Class.getSimpleBinaryName(Class.java:1448)
at java.lang.Class.isMemberClass(Class.java:1433)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved$lzycompute(objects.scala:447)
at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.resolved(objects.scala:441)
  ...
{code}

The most important point in the stack trace is this:
{code}
java.lang.InternalError: Malformed class name
at java.lang.Class.getSimpleBinaryName(Class.java:1450)
at java.lang.Class.isMemberClass(Class.java:1433)
{code}
The most common way to hit the {{"Malformed class name"}} issue in Spark is via 
{{java.lang.Class.getSimpleName}}. But as this stack trace demonstrates, it can 
happen via other code paths from the JDK as well.

If we want to fix it in a similar fashion as {{Utils.getSimpleName}}, we'd have 
to emulate {{java.lang.Class.isMemberClass}} in Spark's {{Utils}}, and then use 
it in the {{NewInstance.resolved}} code path.

Here's a reproducer test case (in diff form against Spark master's 
{{ExpressionEncoderSuite}} ):
{code}
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
index 2635264..fd1b23d 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
@@ -217,6 +217,95 @@ class ExpressionEncoderSuite extends 
CodegenInterpretedPlanTest with AnalysisTes
   "nested Scala class should work")
   }
 
+  object OuterLevelWithVeryVeryVeryLongClassName1 {
+object OuterLevelWithVeryVeryVeryLongClassName2 {
+  object OuterLevelWithVeryVeryVeryLongClassName3 {
+object OuterLevelWithVeryVeryVeryLongClassName4 {
+  object OuterLevelWithVeryVeryVeryLongClassName5 {
+object OuterLevelWithVeryVeryVeryLongClassName6 {
+  object OuterLevelWithVeryVeryVeryLongClassName7 {
+object OuterLevelWithVeryVeryVeryLongClassName8 {
+  object OuterLevelWithVeryVeryVeryLongClassName9 {
+object OuterLevelWithVeryVeryVeryLongClassName10 {
+  object OuterLevelWithVeryVeryVeryLongClassName11 {
+object OuterLevelWithVeryVeryVeryLongClassName12 {
+  object OuterLevelWithVeryVeryVeryLongClassName13 {
+object OuterLevelWithVeryVeryVeryLongClassName14 {
+  object OuterLevelWithVeryVeryVeryLongClassName15 
{
+object 
OuterLevelWithVeryVeryVeryLongClassName16 {
+  object 
OuterLevelWithVeryVeryVeryLongClassName17 {
+object 
OuterLevelWithVeryVeryVeryLongClassName18 {
+  object 
OuterLevelWithVeryVeryVeryLongClassName19 {
+object 
OuterLevelWithVeryVeryVeryLongClassName20 {
+  case class MalformedNameExample2(x: 
Int)
+}
+  }
+}
+  }
+}
+  }
+}
+  }
+}
+  }
+

[jira] [Created] (SPARK-34596) NewInstance.doGenCode should not throw malformed class name error

2021-03-02 Thread Kris Mok (Jira)

Kris Mok created SPARK-34596:


 Summary: NewInstance.doGenCode should not throw malformed class 
name error
 Key: SPARK-34596
 URL: https://issues.apache.org/jira/browse/SPARK-34596
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2, 2.4.7, 3.1.0
Reporter: Kris Mok


Similar to SPARK-32238 and SPARK-32999, the use of 
{{java.lang.Class.getSimpleName}} in {{NewInstance.doGenCode}} is problematic 
because Scala classes may trigger {{java.lang.InternalError: Malformed class 
name}}.

This happens more often when using nested classes in Scala (or declaring 
classes in Scala REPL which implies class nesting).

Note that on newer versions of JDK the underlying malformed class name no 
longer reproduces (fixed in the JDK by 
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919), so it's less of 
an issue there. But on JDK8u this problem still exists so we still have to fix 
it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32999) TreeNode.nodeName should not throw malformed class name error

2020-09-25 Thread Kris Mok (Jira)

Kris Mok created SPARK-32999:


 Summary: TreeNode.nodeName should not throw malformed class name 
error
 Key: SPARK-32999
 URL: https://issues.apache.org/jira/browse/SPARK-32999
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 2.4.0, 3.1.0
Reporter: Kris Mok


Similar to SPARK-32238, the use of {{java.lang.Class.getSimpleName}} in 
{{TreeNode.nodeName}} is problematic because Scala classes may trigger 
{{java.lang.InternalError: Malformed class name}}.

This happens more often when using nested classes in Scala (or declaring 
classes in Scala REPL which implies class nesting).

Note that on newer versions of JDK the underlying malformed class name no 
longer reproduces, so it's less of an issue there. But on JDK8u this problem 
still exists so we still have to fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31399) Closure cleaner broken in Scala 2.12

2020-04-30 Thread Kris Mok (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096191#comment-17096191
 ] 

Kris Mok edited comment on SPARK-31399 at 4/30/20, 6:32 AM:


Hi [~dongjoon], I've been working on a fix of this issue and will send out a 
WIP PR as soon as possible. I've pretty much done an analysis of the situation 
in parallel to [~joshrosen]'s analysis above and have arrived at very similar 
conclusions.

The fact is, Scala 2.12+'s indylambda (aka LMF-based closures) does still have 
an equivalent of an {{"$outer"}}, just under a different name. Thus the logic 
inside the {{ClosureCleaner}} for Scala 2.11 support has to be ported basically 
verbatim to Scala 2.12+/indylambda. That's exactly what I'm working on right 
now, and it's the main contents of the WIP PR.

A separate issue is that the test coverage of {{ClosureCleaner}} in the Spark 
repo is very insufficient. Neither {{ClosureCleanerSuite}} nor 
{{ClosureCleanerSuite2}} cover anything related to the Scala REPL. There needs 
to be a separate suite, similar to {{ReplSuite}}, that fires up an actual Scala 
REPL and trigger ClosureCleaner in it to bridge the gap in test coverage. I 
will do that as a second step of the PR, and once the new test suite is in, the 
PR can be considered complete and ready for final review.


was (Author: rednaxelafx):
Hi [~dongjoon], I've been working on a fix of this issue and will send out a 
WIP PR as soon as possible. I've pretty much done an analysis of the situation 
in parallel to [~joshrosen]'s analysis above and have arrived at very similar 
conclusions.

The fact is, Scala 2.12+'s indylambda (aka LMF-based closures) does still have 
an equivalent of an "$outer", just under a different name. Thus the logic 
inside the `ClosureCleaner` for Scala 2.11 support has to be ported basically 
verbatim to Scala 2.12+/indylambda. That's exactly what I'm working on right 
now, and it's the main contents of the WIP PR.

A separate issue is that the test coverage of ClosureCleaner in the Spark repo 
is very insufficient. There needs to be a separate suite, similar to 
`ReplSuite`, that fires up an actual Scala REPL and trigger ClosureCleaner in 
it to bridge the gap in test coverage. I will do that as a second step of the 
PR, and once the new test suite is in, the PR can be considered complete and 
ready for final review.

> Closure cleaner broken in Scala 2.12
> 
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Kris Mok
>Priority: Blocker
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   -

[jira] [Commented] (SPARK-31399) Closure cleaner broken in Scala 2.12

2020-04-30 Thread Kris Mok (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096191#comment-17096191
 ] 

Kris Mok commented on SPARK-31399:
--

Hi [~dongjoon], I've been working on a fix of this issue and will send out a 
WIP PR as soon as possible. I've pretty much done an analysis of the situation 
in parallel to [~joshrosen]'s analysis above and have arrived at very similar 
conclusions.

The fact is, Scala 2.12+'s indylambda (aka LMF-based closures) does still have 
an equivalent of an "$outer", just under a different name. Thus the logic 
inside the `ClosureCleaner` for Scala 2.11 support has to be ported basically 
verbatim to Scala 2.12+/indylambda. That's exactly what I'm working on right 
now, and it's the main contents of the WIP PR.

A separate issue is that the test coverage of ClosureCleaner in the Spark repo 
is very insufficient. There needs to be a separate suite, similar to 
`ReplSuite`, that fires up an actual Scala REPL and trigger ClosureCleaner in 
it to bridge the gap in test coverage. I will do that as a second step of the 
PR, and once the new test suite is in, the PR can be considered complete and 
ready for final review.

> Closure cleaner broken in Scala 2.12
> 
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Kris Mok
>Priority: Blocker
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
>

[jira] [Created] (SPARK-31240) Constant fold deterministic Scala UDFs with foldable arguments

2020-03-24 Thread Kris Mok (Jira)

Kris Mok created SPARK-31240:


 Summary: Constant fold deterministic Scala UDFs with foldable 
arguments
 Key: SPARK-31240
 URL: https://issues.apache.org/jira/browse/SPARK-31240
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


Constant fold deterministic Scala UDFs with foldable arguments, conservatively.

ScalaUDFs that meet all following criteria are subject to constant folding in 
this feature:
* deterministic
* all arguments are foldable
* does not throw an exception when evaluating the UDF for constant folding



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31187) Sort the whole-stage codegen debug output by codegenStageId

2020-03-19 Thread Kris Mok (Jira)

Kris Mok created SPARK-31187:


 Summary: Sort the whole-stage codegen debug output by 
codegenStageId
 Key: SPARK-31187
 URL: https://issues.apache.org/jira/browse/SPARK-31187
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 3.0.0
Reporter: Kris Mok


Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to 
help with debugging. One way to get the generated code is through 
{{df.queryExecution.debug.codegen}}, or SQL {{explain codegen}} statement.

The generated code is currently printed without specific ordering, which can 
make debugging a bit annoying. This ticket tracks a minor improvement to sort 
the codegen dump by the {{codegenStageId}}, ascending.

After this change, the following query:
{code}
spark.range(10).agg(sum('id)).queryExecution.debug.codegen
{code}
will always dump the generated code in a natural, stable order.

The number of codegen stages within a single SQL query tends to be very small, 
most likely < 50, so the overhead of adding the sorting shouldn't be 
significant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31099) Create migration script for metastore_db

2020-03-10 Thread Kris Mok (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056549#comment-17056549
 ] 

Kris Mok commented on SPARK-31099:
--

Just documenting the fact that users may encounter migration issues when 
upgrading from earlier versions of Spark to Spark 3.0 due to the Hive profile 
upgrade sounds good to me.

Derby migration is unlikely to be a production issue, and for other databases 
(MySQL / PG etc) they’re heavy enough that folks would probably realize it’s a 
Hive metastore migration issue just like what’d happen in Hive.

But the documentation should at the very least describe:
* upgraded Hive profile
* what kind of error messages could occur
* links to Hive documentation of how to perform the upgrade

WDYT?

> Create migration script for metastore_db
> 
>
> Key: SPARK-31099
> URL: https://issues.apache.org/jira/browse/SPARK-31099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> When an existing Derby database exists (in ./metastore_db) created by Hive 
> 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile.
> Repro steps:
> 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive 
> -Phive-thriftserver. Make sure there's no existing ./metastore_db directory 
> in the repo.
> 2. Run bin/spark-shell, and then spark.sql("show databases"). This will 
> populate the ./metastore_db directory, where the Derby-based metastore 
> database is hosted. This database is populated from Hive 1.2.x.
> 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops 
> the Hive 1.2 profile, which makes it use the default Hive 2.3 profile)
> 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby 
> database created in Step (2), which triggers an upgrade step, and that's 
> where the following error will be reported.
> 5. Delete the ./metastore_db and re-run Step (4). The error is no longer 
> reported.
> {code:java}
> 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS 
> ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN 
> ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has 
> been specified as NOT NULL and either the DEFAULT clause was not specified or 
> was specified as DEFAULT NULL.
> java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column 
> 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT 
> clause was not specified or was specified as DEFAULT NULL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830)
>   at 
> org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896)
>   at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672)
>   at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1816)
>   at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744)
>   at org.datanucleus.store.query.Query.execute(Query.java:1726)
>   at

[jira] [Created] (SPARK-30795) Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()

2020-02-11 Thread Kris Mok (Jira)

Kris Mok created SPARK-30795:


 Summary: Spark SQL codegen's code() interpolator should treat 
escapes like Scala's StringContext.s()
 Key: SPARK-30795
 URL: https://issues.apache.org/jira/browse/SPARK-30795
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 3.0.0
Reporter: Kris Mok


The {{code()}} string interpolator in Spark SQL's code generator should treat 
escapes like Scala's builtin {{StringContext.s()}} interpolator, i.e. it should 
treat escapes in the code parts, and should not treat escapes in the input 
arguments.

For example,
{code}
val arg = "This is an argument."
val str = s"This is string part 1. $arg This is string part 2."
val code = code"This is string part 1. $arg This is string part 2."
assert(code.toString == str)
{code}
We should expect the {{code()}} interpolator produce the same thing as the 
{{StringContext.s()}} interpolator, where only escapes in the string parts 
should be treated, while the args should be kept verbatim.

But in the current implementation, due to the eager folding of code parts and 
literal input args, the escape treatment is incorrectly done on both code parts 
and literal args.
That causes a problem when an arg contains escape sequences and wants to 
preserve that in the final produced code string. For example, in {{Like}} 
expression's codegen, there's an ugly workaround for this bug:
{code}
  // We need double escape to avoid 
org.codehaus.commons.compiler.CompileException.
  // '\\' will cause exception 'Single quote must be backslash-escaped in 
character literal'.
  // '\"' will cause exception 'Line break in literal not allowed'.
  val newEscapeChar = if (escapeChar == '\"' || escapeChar == '\\') {
s"""\\$escapeChar"""
  } else {
escapeChar
  }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators

2019-01-26 Thread Kris Mok (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753310#comment-16753310
 ] 

Kris Mok commented on SPARK-26741:
--

Note that there's a similar issue with non-aggregate functions. Here's an 
example:
{code:none}
spark.sql("create table foo (id int, blob binary)")
val df = spark.sql("select length(blob) from foo where id = 1 order by 
length(blob) limit 10")
df.explain(true)
{code}
{code:none}
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Sort ['length('blob) ASC NULLS FIRST], true
  +- 'Project [unresolvedalias('length('blob), None)]
 +- 'Filter ('id = 1)
+- 'UnresolvedRelation `foo`

== Analyzed Logical Plan ==
length(blob): int
GlobalLimit 10
+- LocalLimit 10
   +- Project [length(blob)#25]
  +- Sort [length(blob#24) ASC NULLS FIRST], true
 +- Project [length(blob#24) AS length(blob)#25, blob#24]
+- Filter (id#23 = 1)
   +- SubqueryAlias `default`.`foo`
  +- HiveTableRelation `default`.`foo`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24]

== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Project [length(blob)#25]
  +- Sort [length(blob#24) ASC NULLS FIRST], true
 +- Project [length(blob#24) AS length(blob)#25, blob#24]
+- Filter (isnotnull(id#23) && (id#23 = 1))
   +- HiveTableRelation `default`.`foo`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24]

== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[length(blob#24) ASC NULLS FIRST], 
output=[length(blob)#25])
+- *(1) Project [length(blob#24) AS length(blob)#25, blob#24]
   +- *(1) Filter (isnotnull(id#23) && (id#23 = 1))
  +- Scan hive default.foo [blob#24, id#23], HiveTableRelation 
`default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, 
blob#24]
{code}

Note how the {{Sort}} operator performs the {{length()}} again, despite there's 
one in the projection right below it. The root cause of this problem in the 
Analyzer is the same as the main example in this ticket, although this example 
is not as harmful (at least it still runs...)

> Analyzer incorrectly resolves aggregate function outside of Aggregate 
> operators
> ---
>
> Key: SPARK-26741
> URL: https://issues.apache.org/jira/browse/SPARK-26741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Priority: Major
>
> The analyzer can sometimes hit issues with resolving functions. e.g.
> {code:sql}
> select max(id)
> from range(10)
> group by id
> having count(1) >= 1
> order by max(id)
> {code}
> The analyzed plan of this query is:
> {code:none}
> == Analyzed Logical Plan ==
> max(id): bigint
> Project [max(id)#91L]
> +- Sort [max(id#88L) ASC NULLS FIRST], true
>+- Project [max(id)#91L, id#88L]
>   +- Filter (count(1)#93L >= cast(1 as bigint))
>  +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS 
> count(1)#93L, id#88L]
> +- Range (0, 10, step=1, splits=None)
> {code}
> Note how an aggregate function is outside of {{Aggregate}} operators in the 
> fully analyzed plan:
> {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid.
> Trying to run this query will lead to weird issues in codegen, but the root 
> cause is in the analyzer:
> {code:none}
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> max(input[1, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:237)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
>

[jira] [Created] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators

2019-01-26 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26741:


 Summary: Analyzer incorrectly resolves aggregate function outside 
of Aggregate operators
 Key: SPARK-26741
 URL: https://issues.apache.org/jira/browse/SPARK-26741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


The analyzer can sometimes hit issues with resolving functions. e.g.
{code:sql}
select max(id)
from range(10)
group by id
having count(1) >= 1
order by max(id)
{code}
The analyzed plan of this query is:
{code:none}
== Analyzed Logical Plan ==
max(id): bigint
Project [max(id)#91L]
+- Sort [max(id#88L) ASC NULLS FIRST], true
   +- Project [max(id)#91L, id#88L]
  +- Filter (count(1)#93L >= cast(1 as bigint))
 +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS 
count(1)#93L, id#88L]
+- Range (0, 10, step=1, splits=None)
{code}
Note how an aggregate function is outside of {{Aggregate}} operators in the 
fully analyzed plan:
{{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid.

Trying to run this query will lead to weird issues in codegen, but the root 
cause is in the analyzer:
{code:none}
java.lang.UnsupportedOperationException: Cannot generate code for expression: 
max(input[1, bigint, false])
  at 
org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
  at 
org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
  at 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
  at scala.Option.getOrElse(Option.scala:138)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at scala.collection.TraversableLike.map(TraversableLike.scala:237)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
  at scala.collection.AbstractTraversable.map(Traversable.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:192)
  at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2470)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2684)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:299)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:712)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:721)
{code}

The test case {{SPARK-23957 Remove redundant sort from subquery plan(scalar 
subquery)}} in {{SubquerySuite}} has been disabled because of hitting this 
issue, caught by SPARK-26735. We should re-enable that test once this bug is 
fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26735) Verify plan integrity for special expressions

2019-01-25 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26735:


 Summary: Verify plan integrity for special expressions
 Key: SPARK-26735
 URL: https://issues.apache.org/jira/browse/SPARK-26735
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


Add verification of plan integrity with regards to special expressions being 
hosted only in supported operators. Specifically:
* AggregateExpression: should only be hosted in Aggregate, or indirectly in 
Window
* WindowExpression: should only be hosted in Window
* Generator: should only be hosted in Generate



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26664) Make DecimalType's minimum adjusted scale configurable

2019-01-18 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26664:


 Summary: Make DecimalType's minimum adjusted scale configurable
 Key: SPARK-26664
 URL: https://issues.apache.org/jira/browse/SPARK-26664
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


Introduce a new conf flag that allows the user to set the value of 
{{DecimalType.MINIMAL_ADJUSTED_SCALE}}, currently a constant of 6, to match 
their workloads' needs.

The new flag will be {{spark.sql.decimalOperations.minimumAdjustedScale}}.

SPARK-22036 introduced a new conf flag 
{{spark.sql.decimalOperations.allowPrecisionLoss}} to match SQL Server's and 
new Hive's behavior of allowing precision loss when performing 
multiplication/division on big and small decimal numbers.
Along with this feature, a fixed {{MINIMAL_ADJUSTED_SCALE}} was set to 6 for 
when precision loss is allowed.

Some customer workload may needed a larger adjusted scale to match their 
business needs, and in exchange they may be willing to tolerate some more 
calculations overflowing the max precision, leading to nulls. So they would 
like the minimum adjusted scale to be configurable. Thus the need for a new 
conf.

The default behavior after introducing this conf flag is not changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26661) Show actual class name of the writing command in CTAS explain

2019-01-18 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26661:


 Summary: Show actual class name of the writing command in CTAS 
explain
 Key: SPARK-26661
 URL: https://issues.apache.org/jira/browse/SPARK-26661
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


The explain output of the Hive CTAS command, regardless of whether it's 
actually writing via Hive's SerDe or converted into using Spark's data source, 
would always show that it's using {{InsertIntoHiveTable}} because it's 
hardcoded.

e.g.
{code:none}
Execute OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: 
foo, InsertIntoHiveTable]
{code}
This CTAS is converted into using Spark's data source, but it still says 
{{InsertIntoHiveTable}} in the explain output.

It's better to show the actual class name of the writing command used. For the 
example above, it'd be:
{code:none}
Execute OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: 
foo, InsertIntoHadoopFsRelationCommand]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26659) Duplicate cmd.nodeName in the explain output of DataWritingCommandExec

2019-01-17 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26659:


 Summary: Duplicate cmd.nodeName in the explain output of 
DataWritingCommandExec
 Key: SPARK-26659
 URL: https://issues.apache.org/jira/browse/SPARK-26659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


{{DataWritingCommandExec}} generates {{cmd.nodeName}} twice in its explain 
output, e.g. when running this query

{code:none}
spark.sql("create table foo stored as parquet as select id, id % 10 as cat1, id 
% 20 as cat2 from range(10)")
{code}

The query plan's explain output would be:

{code:none}
Execute OptimizedCreateHiveTableAsSelectCommand 
OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: foo, 
InsertIntoHiveTable]
+- *(1) Project [id#2L, (id#2L % 10) AS cat1#0L, (id#2L % 20) AS cat2#1L]
   +- *(1) Range (0, 10, step=1, splits=8)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26633) Add ExecutorClassLoader.getResourceAsStream

2019-01-15 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26633:


 Summary: Add ExecutorClassLoader.getResourceAsStream
 Key: SPARK-26633
 URL: https://issues.apache.org/jira/browse/SPARK-26633
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell
Affects Versions: 3.0.0
Reporter: Kris Mok


{{ExecutorClassLoader}} is capable of loading dynamically generated classes 
from the REPL via either RPC or HDFS, but right now it always delegates 
resource loading to the parent class loader. That makes the dynamically 
generated classes unavailable to uses other than class loading.

Such need may arise, for example, when json4s wants to parse the Class file to 
extract parameter name information. Internally it'd call the class loader's 
{{getResourceAsStream}} to obtain the Class file content as an {{InputStream}}.

This ticket tracks an improvement to the {{ExecutorClassLoader}} to allow 
fetching dynamically generated Class files from the REPL as resource streams.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26545) Fix typo in EqualNullSafe's truth table comment

2019-01-05 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26545:


 Summary: Fix typo in EqualNullSafe's truth table comment
 Key: SPARK-26545
 URL: https://issues.apache.org/jira/browse/SPARK-26545
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


The truth table comment in {{EqualNullSafe}} incorrectly marked FALSE results 
as UNKNOWN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26352) join reordering should not change the order of output attributes

2018-12-13 Thread Kris Mok (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Mok updated SPARK-26352:
-
Summary: join reordering should not change the order of output attributes  
(was: ReorderJoin should not change the order of columns)

> join reordering should not change the order of output attributes
> 
>
> Key: SPARK-26352
> URL: https://issues.apache.org/jira/browse/SPARK-26352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kris Mok
>Priority: Major
>
> The optimizer rule {{org.apache.spark.sql.catalyst.optimizer.ReorderJoin}} 
> performs join reordering on inner joins. This was introduced from SPARK-12032 
> in 2015-12.
> After it had reordered the joins, though, it didn't check whether or not the 
> column order (in terms of the {{output}} attribute list) is still the same as 
> before. Thus, it's possible to have a mismatch between the reordered column 
> order vs the schema that a DataFrame thinks it has.
> This can be demonstrated with the example:
> {code:none}
> spark.sql("create table table_a (x int, y int) using parquet")
> spark.sql("create table table_b (i int, j int) using parquet")
> spark.sql("create table table_c (a int, b int) using parquet")
> val df = spark.sql("with df1 as (select * from table_a cross join table_b) 
> select * from df1 join table_c on a = x and b = i")
> {code}
> here's what the DataFrame thinks:
> {code:none}
> scala> df.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: integer (nullable = true)
>  |-- i: integer (nullable = true)
>  |-- j: integer (nullable = true)
>  |-- a: integer (nullable = true)
>  |-- b: integer (nullable = true)
> {code}
> here's what the optimized plan thinks, after join reordering:
> {code:none}
> scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- 
> ${a.name}: ${a.dataType.typeName}"))
> |-- x: integer
> |-- y: integer
> |-- a: integer
> |-- b: integer
> |-- i: integer
> |-- j: integer
> {code}
> If we exclude the {{ReorderJoin}} rule (using Spark 2.4's optimizer rule 
> exclusion feature), it's back to normal:
> {code:none}
> scala> spark.conf.set("spark.sql.optimizer.excludedRules", 
> "org.apache.spark.sql.catalyst.optimizer.ReorderJoin")
> scala> val df = spark.sql("with df1 as (select * from table_a cross join 
> table_b) select * from df1 join table_c on a = x and b = i")
> df: org.apache.spark.sql.DataFrame = [x: int, y: int ... 4 more fields]
> scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- 
> ${a.name}: ${a.dataType.typeName}"))
> |-- x: integer
> |-- y: integer
> |-- i: integer
> |-- j: integer
> |-- a: integer
> |-- b: integer
> {code}
> Note that this column ordering problem leads to data corruption, and can 
> manifest itself in various symptoms:
> * Silently corrupting data, if the reordered columns happen to either have 
> matching types or have sufficiently-compatible types (e.g. all fixed length 
> primitive types are considered as "sufficiently compatible" in an UnsafeRow), 
> then only the resulting data is going to be wrong but it might not trigger 
> any alarms immediately. Or
> * Weird Java-level exceptions like {{java.lang.NegativeArraySizeException}}, 
> or even SIGSEGVs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26352) ReorderJoin should not change the order of columns

2018-12-12 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26352:


 Summary: ReorderJoin should not change the order of columns
 Key: SPARK-26352
 URL: https://issues.apache.org/jira/browse/SPARK-26352
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: Kris Mok


The optimizer rule {{org.apache.spark.sql.catalyst.optimizer.ReorderJoin}} 
performs join reordering on inner joins. This was introduced from SPARK-12032 
in 2015-12.

After it had reordered the joins, though, it didn't check whether or not the 
column order (in terms of the {{output}} attribute list) is still the same as 
before. Thus, it's possible to have a mismatch between the reordered column 
order vs the schema that a DataFrame thinks it has.

This can be demonstrated with the example:
{code:none}
spark.sql("create table table_a (x int, y int) using parquet")
spark.sql("create table table_b (i int, j int) using parquet")
spark.sql("create table table_c (a int, b int) using parquet")
val df = spark.sql("with df1 as (select * from table_a cross join table_b) 
select * from df1 join table_c on a = x and b = i")
{code}
here's what the DataFrame thinks:
{code:none}
scala> df.printSchema
root
 |-- x: integer (nullable = true)
 |-- y: integer (nullable = true)
 |-- i: integer (nullable = true)
 |-- j: integer (nullable = true)
 |-- a: integer (nullable = true)
 |-- b: integer (nullable = true)
{code}
here's what the optimized plan thinks, after join reordering:
{code:none}
scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- 
${a.name}: ${a.dataType.typeName}"))
|-- x: integer
|-- y: integer
|-- a: integer
|-- b: integer
|-- i: integer
|-- j: integer
{code}

If we exclude the {{ReorderJoin}} rule (using Spark 2.4's optimizer rule 
exclusion feature), it's back to normal:
{code:none}
scala> spark.conf.set("spark.sql.optimizer.excludedRules", 
"org.apache.spark.sql.catalyst.optimizer.ReorderJoin")

scala> val df = spark.sql("with df1 as (select * from table_a cross join 
table_b) select * from df1 join table_c on a = x and b = i")
df: org.apache.spark.sql.DataFrame = [x: int, y: int ... 4 more fields]

scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- 
${a.name}: ${a.dataType.typeName}"))
|-- x: integer
|-- y: integer
|-- i: integer
|-- j: integer
|-- a: integer
|-- b: integer
{code}

Note that this column ordering problem leads to data corruption, and can 
manifest itself in various symptoms:
* Silently corrupting data, if the reordered columns happen to either have 
matching types or have sufficiently-compatible types (e.g. all fixed length 
primitive types are considered as "sufficiently compatible" in an UnsafeRow), 
then only the resulting data is going to be wrong but it might not trigger any 
alarms immediately. Or
* Weird Java-level exceptions like {{java.lang.NegativeArraySizeException}}, or 
even SIGSEGVs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26107) Extend ReplaceNullWithFalseInPredicate to support higher-order functions: ArrayExists, ArrayFilter, MapFilter

2018-11-18 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26107:


 Summary: Extend ReplaceNullWithFalseInPredicate to support 
higher-order functions: ArrayExists, ArrayFilter, MapFilter
 Key: SPARK-26107
 URL: https://issues.apache.org/jira/browse/SPARK-26107
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


Extend the {{ReplaceNullWithFalse}} optimizer rule introduced in SPARK-25860 to 
also support optimizing predicates in higher-order functions of 
{{ArrayExists}}, {{ArrayFilter}}, {{MapFilter}}.

Also rename the rule to {{ReplaceNullWithFalseInPredicate}} to better reflect 
its intent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26061) Reduce the number of unused UnsafeRowWriters created in whole-stage codegen

2018-11-14 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26061:


 Summary: Reduce the number of unused UnsafeRowWriters created in 
whole-stage codegen
 Key: SPARK-26061
 URL: https://issues.apache.org/jira/browse/SPARK-26061
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0
Reporter: Kris Mok


Reduce the number of unused UnsafeRowWriters created in whole-stage generated 
code.
They come from the CodegenSupport.consume() calling prepareRowVar(), which uses 
GenerateUnsafeProjection.createCode() and registers an UnsafeRowWriter mutable 
state, regardless of whether or not the downstream (parent) operator will use 
the rowVar or not.
Even when the downstream doConsume function doesn't use the rowVar (i.e. 
doesn't put row.code as a part of this operator's codegen template), the 
registered UnsafeRowWriter stays there, which makes the init function of the 
generated code a bit bloated.

This ticket doesn't track the root issue, but makes it slightly less painful: 
when the doConsume function is split out, the prepareRowVar() function is 
called twice, so it's double the pain of unused UnsafeRowWriters. This fix 
simply moves the original call to prepareRowVar() down into the doConsume 
split/no-split branch so that we're back to just 1x the pain.

To fix the root issue, something that allows the CodegenSupport operators to 
indicate whether or not they're going to use the rowVar would be needed. That's 
a much more elaborate change so I'd like to just make a minor fix first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25961) Random numbers are not supported when handling data skew

2018-11-08 Thread Kris Mok (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680152#comment-16680152
 ] 

Kris Mok commented on SPARK-25961:
--

It looks like the current restriction makes sense, because the expressions in 
join condition may eventually be evaluated multiple times depending on which 
physical join operator is chosen. It doesn't make a lot of sense to allow 
non-deterministic expression directly in the Join operator.

Instead, if we have to support having non-deterministic expression in the join 
condition and retain an "evaluated-once" semantic, it'd be better to have a 
rule in the Analyzer to extract non-deterministic expressions from the join 
condition and put it into a child Project operator on the appropriate side.

[~zengxl] does that make sense to you?

> Random numbers are not supported when handling data skew
> 
>
> Key: SPARK-25961
> URL: https://issues.apache.org/jira/browse/SPARK-25961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: spark on yarn 2.3.1
>Reporter: zengxl
>Priority: Major
>
> my query sql use two table join,one table join key has null value,i use rand 
> value instead of null value,but has error,the error info as follows：
> Error in query: nondeterministic expressions are only allowed in
> Project, Filter, Aggregate or Window, found
>  
>  
> scan spark source code is 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis check sql, because the 
> number of random variables is uncertain, it is prohibited
> case o if o.expressions.exists(!_.deterministic) &&
>  !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
>  !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] =>
>  // The rule above is used to check Aggregate operator.
>  failAnalysis(
>  s"""nondeterministic expressions are only allowed in
> |Project, Filter, Aggregate or Window, found:|
> |${o.expressions.map(_.sql).mkString(",")}|
> |in operator ${operator.simpleString}
>  """.stripMargin)|
>  
> Is it possible to add Join to this code? It's not yet tested.And whether 
> there will be other effects
> case o if o.expressions.exists(!_.deterministic) &&
>  !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
>  !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] +{color:#d04437}&& 
> !o.isInstanceOf[Join]{color}+ =>
>  // The rule above is used to check Aggregate operator.
>  failAnalysis(
>  s"""nondeterministic expressions are only allowed in
> |Project, Filter, Aggregate or Window or Join, found:|
> |${o.expressions.map(_.sql).mkString(",")}|
> |in operator ${operator.simpleString}
>  """.stripMargin)|
>  
> this is my sparksql：
> SELECT
>  T1.CUST_NO AS CUST_NO ,
>  T3.CON_LAST_NAME AS CUST_NAME ,
>  T3.CON_SEX_MF AS SEX_CODE ,
>  T3.X_POSITION AS POST_LV_CODE 
>  FROM tmp.ICT_CUST_RANGE_INFO T1
>  LEFT join tmp.F_CUST_BASE_INFO_ALL T3 ON CASE WHEN coalesce(T1.CUST_NO,'') 
> ='' THEN concat('cust_no',RAND()) ELSE T1.CUST_NO END = T3.BECIF and 
> T3.DATE='20181105'
>  WHERE T1.DATE='20181105'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25494) Upgrade Spark's use of Janino to 3.0.10

2018-09-20 Thread Kris Mok (JIRA)

Kris Mok created SPARK-25494:


 Summary: Upgrade Spark's use of Janino to 3.0.10
 Key: SPARK-25494
 URL: https://issues.apache.org/jira/browse/SPARK-25494
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kris Mok


This ticket proposes to upgrade Spark's use of Janino from 3.0.9 to 3.0.10.
Note that 3.0.10 is a out-of-band release specifically for fixing an integer 
overflow issue in Janino's {{ClassFile}} reader. It is otherwise exactly the 
same as 3.0.9, so it's a low risk and compatible upgrade.

The integer overflow issue affects Spark SQL's codegen stats collection: when a 
generated Class file is huge, especially when the constant pool size is above 
{{Short.MAX_VALUE}}, Janino's {{ClassFile}} reader will throw an exception when 
Spark wants to parse the generated Class file to collect stats. So we'll miss 
the stats of some huge Class files.

The Janino fix is tracked by this issue: 
https://github.com/janino-compiler/janino/issues/58



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-21 Thread Kris Mok (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587820#comment-16587820
 ] 

Kris Mok commented on SPARK-25178:
--

FYI: I've just come across this behavior and raised a ticket so that we don't 
forget about it, but I'm not working on this one (yet).

> Use dummy name for xxxHashMapGenerator key/value schema field
> -
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-21 Thread Kris Mok (JIRA)

Kris Mok created SPARK-25178:


 Summary: Use dummy name for xxxHashMapGenerator key/value schema 
field
 Key: SPARK-25178
 URL: https://issues.apache.org/jira/browse/SPARK-25178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kris Mok


Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
generated field name of the keySchema / valueSchema to a dummy name instead of 
using {{key.name}}.

In previous discussion from SPARK-18952's PR [1], it was already suggested that 
the field names were being used, so it's not worth capturing the strings as 
reference objects here. Josh suggested merging the original fix as-is due to 
backportability / pickability concerns. Now that we're coming up to a new 
release, this can be revisited.

[1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode

2018-08-17 Thread Kris Mok (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583486#comment-16583486
 ] 

Kris Mok commented on SPARK-25140:
--

Thanks maropu-san! No, I'm not working on this ticket; I'm currently swamped by 
other tasks close to a deadline so I couldn't work on this one right now.

I was just thinking about resetting my mental model for the Spark 2.4.x code 
execution performance, and saw that after the fallback was implemented, it's 
very hard for the user to figure out whether or not the interpreter fallback is 
taking effect and/or whether or not it's contributing to slow performance. This 
would be valuable information to have for query tuning, etc.

> Add optional logging to UnsafeProjection.create when it falls back to 
> interpreted mode
> --
>
> Key: SPARK-25140
> URL: https://issues.apache.org/jira/browse/SPARK-25140
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection 
> to fall back to an interpreter mode when codegen fails. That makes Spark much 
> more usable even when codegen is unable to handle the given query.
> But in its current form, the fallback handling can also be a mystery in terms 
> of performance cliffs. Users may be left wondering why a query runs fine with 
> some expressions, but then with just one extra expression the performance 
> goes 2x, 3x (or more) slower.
> It'd be nice to have optional logging of the fallback behavior, so that for 
> users that care about monitoring performance cliffs, they can opt-in to log 
> when a fallback to interpreter mode was taken. i.e. at
> https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode

2018-08-17 Thread Kris Mok (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Mok updated SPARK-25140:
-
Description: 
SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection 
to fall back to an interpreter mode when codegen fails. That makes Spark much 
more usable even when codegen is unable to handle the given query.

But in its current form, the fallback handling can also be a mystery in terms 
of performance cliffs. Users may be left wondering why a query runs fine with 
some expressions, but then with just one extra expression the performance goes 
2x, 3x (or more) slower.

It'd be nice to have optional logging of the fallback behavior, so that for 
users that care about monitoring performance cliffs, they can opt-in to log 
when a fallback to interpreter mode was taken. i.e. at
https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183

  was:
SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection 
to fall back to an interpreter mode when codegen fails. That makes Spark much 
more usable even when codegen is unable to handle the given query.

But in its current form, the fallback handling can also be a mystery in terms 
of performance cliffs. Users may be left wondering why a query runs fine with 
some expressions, but then with just one extra expression the performance goes 
2x, 3x (or more) slower.

It'd be nice to have optional logging of the fallback behavior, so that for 
users that care about monitoring performance cliffs, they can opt-in to log 
when a fallback to interpreter mode was taken.


> Add optional logging to UnsafeProjection.create when it falls back to 
> interpreted mode
> --
>
> Key: SPARK-25140
> URL: https://issues.apache.org/jira/browse/SPARK-25140
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection 
> to fall back to an interpreter mode when codegen fails. That makes Spark much 
> more usable even when codegen is unable to handle the given query.
> But in its current form, the fallback handling can also be a mystery in terms 
> of performance cliffs. Users may be left wondering why a query runs fine with 
> some expressions, but then with just one extra expression the performance 
> goes 2x, 3x (or more) slower.
> It'd be nice to have optional logging of the fallback behavior, so that for 
> users that care about monitoring performance cliffs, they can opt-in to log 
> when a fallback to interpreter mode was taken. i.e. at
> https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode

2018-08-17 Thread Kris Mok (JIRA)

Kris Mok created SPARK-25140:


 Summary: Add optional logging to UnsafeProjection.create when it 
falls back to interpreted mode
 Key: SPARK-25140
 URL: https://issues.apache.org/jira/browse/SPARK-25140
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kris Mok


SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection 
to fall back to an interpreter mode when codegen fails. That makes Spark much 
more usable even when codegen is unable to handle the given query.

But in its current form, the fallback handling can also be a mystery in terms 
of performance cliffs. Users may be left wondering why a query runs fine with 
some expressions, but then with just one extra expression the performance goes 
2x, 3x (or more) slower.

It'd be nice to have optional logging of the fallback behavior, so that for 
users that care about monitoring performance cliffs, they can opt-in to log 
when a fallback to interpreter mode was taken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25138) Spark Shell should show the Scala prompt after initialization is complete

2018-08-17 Thread Kris Mok (JIRA)

Kris Mok created SPARK-25138:


 Summary: Spark Shell should show the Scala prompt after 
initialization is complete
 Key: SPARK-25138
 URL: https://issues.apache.org/jira/browse/SPARK-25138
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.4.0
Reporter: Kris Mok


In previous Spark versions, the Spark Shell used to only show the Scala prompt 
*after* Spark has initialized. i.e. when the user is able to enter code, the 
Spark context, Spark session etc have all completed initialization, so {{sc}}, 
{{spark}} are all ready to use.

In the current Spark master branch (to become Spark 2.4.0), the Scala prompt 
shows up immediately, while Spark itself is still in initialization in the 
background. It's very easy for the user to feel as if the shell is ready and 
start typing, only to find that Spark isn't ready yet, and Spark's 
initialization logs get in the way of typing. This new behavior is rather 
annoying from a usability's perspective.

A typical startup of the Spark Shell in current master:
{code:none}
$ bin/spark-shell
18/08/16 23:18:05 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
  /_/
 
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(1)Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1534486692744).
Spark session available as 'spark'.
.show
+---+
| id|
+---+
|  0|
+---+


scala> 
{code}

Could you see that it was running {{spark.range(1).show}} ?

In contrast, previous versions of Spark Shell would wait for Spark to fully 
initialization:
{code:none}
$ bin/spark-shell
18/08/16 23:20:05 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://10.0.0.76:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1534486813159).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.3-SNAPSHOT
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(1).show
+---+
| id|
+---+
|  0|
+---+


scala> 
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit

2018-08-14 Thread Kris Mok (JIRA)

Kris Mok created SPARK-25113:


 Summary: Add logging to CodeGenerator when any generated method's 
bytecode size goes above HugeMethodLimit
 Key: SPARK-25113
 URL: https://issues.apache.org/jira/browse/SPARK-25113
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kris Mok


Add logging to help collect statistics on how often real world usage sees the 
{{CodeGenerator}} generating methods whose bytecode size goes above the 8000 
bytes (HugeMethodLimit) threshold.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24659) GenericArrayData.equals should respect element type differences

2018-06-26 Thread Kris Mok (JIRA)

Kris Mok created SPARK-24659:


 Summary: GenericArrayData.equals should respect element type 
differences
 Key: SPARK-24659
 URL: https://issues.apache.org/jira/browse/SPARK-24659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.3.0, 2.4.0
Reporter: Kris Mok


Right now, Spark SQL's {{GenericArrayData.equals}} doesn't always respect 
element type differences, due to a caveat in Scala's {{==}} operator.

e.g. {{new GenericArrayData(Array[Int](123)).equals(new 
GenericArrayData(Array[Long](123L)))}} currently returns true. But that's 
against the semantics of Spark SQL's array type, where {{array}} and 
{{array}} are considered to be incompatible types and thus should never 
be equal.

This ticket proposes to fix the implementation of {{GenericArrayData.equals}} 
so that it's more aligned to Spark SQL's array type semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24321) Extract common code from Divide/Remainder to a base trait

2018-05-18 Thread Kris Mok (JIRA)

Kris Mok created SPARK-24321:


 Summary: Extract common code from Divide/Remainder to a base trait
 Key: SPARK-24321
 URL: https://issues.apache.org/jira/browse/SPARK-24321
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kris Mok


There's a lot of code duplication between {{Divide}} and {{Remainder}} 
expression types. They're mostly the codegen template (which is exactly the 
same, with just cosmetic differences), the eval function structure, etc.

It tedious to have to update multiple places in case we make improvements to 
the codegen templates in the future. This ticket proposes to refactor the 
duplicate code into a common base trait for these two classes.

Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} 
and {{Remainder}}, so in theory we can make a deeper refactoring to accommodate 
this class as well. But the "operation" part of its codegen template is harder 
to factor into the base trait, so this ticket only handles {{Divide}} and 
{{Remainder}} for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23960) Mark HashAggregateExec.bufVars as transient

2018-04-11 Thread Kris Mok (JIRA)

Kris Mok created SPARK-23960:


 Summary: Mark HashAggregateExec.bufVars as transient
 Key: SPARK-23960
 URL: https://issues.apache.org/jira/browse/SPARK-23960
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kris Mok


{{HashAggregateExec.bufVars}} is only used during codegen for global 
aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is on 
the stack.
Currently, if an {{HashAggregateExec}} is ever captured for serialization, the 
{{bufVars}} would be needlessly serialized.

This ticket proposes a minor change to mark the {{bufVars}} field as transient 
to avoid serializing it. Also, null out this field at the end of 
{{doProduceWithoutKeys()}} to reduce its lifecycle so that the 
{{Seq[ExprCode]}} being referenced can be GC'd sooner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23947) Add hashUTF8String convenience method to hasher classes

2018-04-09 Thread Kris Mok (JIRA)

Kris Mok created SPARK-23947:


 Summary: Add hashUTF8String convenience method to hasher classes
 Key: SPARK-23947
 URL: https://issues.apache.org/jira/browse/SPARK-23947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kris Mok


Add {{hashUTF8String()}} to the hasher classes to allow Spark SQL codegen to 
generate cleaner code for hashing {{UTF8String}}. No change in behavior 
otherwise.

Although with the introduction of SPARK-10399, the code size for hashing 
{{UTF8String}} is already smaller, it's still good to extract a separate 
function in the hasher classes so that the generated code can stay clean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly

2018-03-20 Thread Kris Mok (JIRA)

Kris Mok created SPARK-23760:


 Summary: CodegenContext.withSubExprEliminationExprs should 
save/restore CSE state correctly
 Key: SPARK-23760
 URL: https://issues.apache.org/jira/browse/SPARK-23760
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.1, 2.2.0
Reporter: Kris Mok


There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes it 
effectively always clear the subexpression elimination state after it's called.

The original intent of this function was that it should save the old state, set 
the given new state as current and perform codegen (invoke 
{{Expression.genCode()}}), and at the end restore the subexpression elimination 
state back to the old state. This ticket tracks a fix to actually implement the 
original intent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23447) Cleanup codegen template for Literal

2018-02-15 Thread Kris Mok (JIRA)

Kris Mok created SPARK-23447:


 Summary: Cleanup codegen template for Literal
 Key: SPARK-23447
 URL: https://issues.apache.org/jira/browse/SPARK-23447
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1, 2.2.0
Reporter: Kris Mok


Ideally, the codegen templates for {{Literal}} should emit literals in the 
{{isNull}} and {{value}} fields of {{ExprCode}} so that they can be effectively 
inlined into their use sites.
But currently there are a couple of paths where {{Literal.doGenCode()}} return 
{{ExprCode}} that has non-trivial {{code}} field, and all of those are actually 
unnecessary.

We can make a simple refactoring to make sure all codegen templates for 
{{Literal}} return empty {{code}} and simple literal/constant expressions in 
{{isNull}} and {{value}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23021) AnalysisBarrier should not cut off the explain output for Parsed Logical Plan

2018-01-12 Thread Kris Mok (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324318#comment-16324318
 ] 

Kris Mok commented on SPARK-23021:
--

Thank you very much for explaining the differences between {{children}} and 
{{innerChildren}}, [~viirya] and [~cloud_fan]! TIL :-)

> AnalysisBarrier should not cut off the explain output for Parsed Logical Plan
> -
>
> Key: SPARK-23021
> URL: https://issues.apache.org/jira/browse/SPARK-23021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>
> In PR#20094 as a follow up to SPARK-20392, there were some fixes to the 
> handling of {{AnalysisBarrier}}, but there seem to be more cases that need to 
> be fixed.
> One such case is that right now the Parsed Logical Plan in explain output 
> would be cutoff by {{AnalysisBarrier}}, e.g.
> {code:none}
> scala> val df1 = spark.range(1).select('id as 'x, 'id + 1 as 
> 'y).repartition(1).select('x === 'y)
> df1: org.apache.spark.sql.DataFrame = [(x = y): boolean]
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [('x = 'y) AS (x = y)#22]
> +- AnalysisBarrier Repartition 1, true
> == Analyzed Logical Plan ==
> (x = y): boolean
> Project [(x#16L = y#17L) AS (x = y)#22]
> +- Repartition 1, true
>+- Project [id#13L AS x#16L, (id#13L + cast(1 as bigint)) AS y#17L]
>   +- Range (0, 1, step=1, splits=Some(8))
> == Optimized Logical Plan ==
> Project [(x#16L = y#17L) AS (x = y)#22]
> +- Repartition 1, true
>+- Project [id#13L AS x#16L, (id#13L + 1) AS y#17L]
>   +- Range (0, 1, step=1, splits=Some(8))
> == Physical Plan ==
> *Project [(x#16L = y#17L) AS (x = y)#22]
> +- Exchange RoundRobinPartitioning(1)
>+- *Project [id#13L AS x#16L, (id#13L + 1) AS y#17L]
>   +- *Range (0, 1, step=1, splits=8)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23021) AnalysisBarrier should not cut off the explain output for Parsed Logical Plan

2018-01-11 Thread Kris Mok (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323672#comment-16323672
 ] 

Kris Mok commented on SPARK-23021:
--

Hi [~maropu]-san,

Thanks! I'm not familiar with this part of the code, but IIUC there's two 
points to this:
1. The explain output shouldn't be changed because of the newly added 
{{AnalysisBarrier}}. The current behavior (with the cutoff) should be 
considered a regression, although it's not a major behavioral one.
2. The reason why {{AnalysisBarrier}} didn't override {{innerChildren}} was by 
design: it wanted to cut off the recursion to help avoid stack overflows when 
the analyzer needs to walk the plan tree.

[~cloud_fan] may have recently fixed a similar issue where the "analyzed 
logical plan" was cut off by {{AnalysisBarrier}} in the same way. Is that the 
case, [~cloud_fan]?

> AnalysisBarrier should not cut off the explain output for Parsed Logical Plan
> -
>
> Key: SPARK-23021
> URL: https://issues.apache.org/jira/browse/SPARK-23021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>
> In PR#20094 as a follow up to SPARK-20392, there were some fixes to the 
> handling of {{AnalysisBarrier}}, but there seem to be more cases that need to 
> be fixed.
> One such case is that right now the Parsed Logical Plan in explain output 
> would be cutoff by {{AnalysisBarrier}}, e.g.
> {code:none}
> scala> val df1 = spark.range(1).select('id as 'x, 'id + 1 as 
> 'y).repartition(1).select('x === 'y)
> df1: org.apache.spark.sql.DataFrame = [(x = y): boolean]
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [('x = 'y) AS (x = y)#22]
> +- AnalysisBarrier Repartition 1, true
> == Analyzed Logical Plan ==
> (x = y): boolean
> Project [(x#16L = y#17L) AS (x = y)#22]
> +- Repartition 1, true
>+- Project [id#13L AS x#16L, (id#13L + cast(1 as bigint)) AS y#17L]
>   +- Range (0, 1, step=1, splits=Some(8))
> == Optimized Logical Plan ==
> Project [(x#16L = y#17L) AS (x = y)#22]
> +- Repartition 1, true
>+- Project [id#13L AS x#16L, (id#13L + 1) AS y#17L]
>   +- Range (0, 1, step=1, splits=Some(8))
> == Physical Plan ==
> *Project [(x#16L = y#17L) AS (x = y)#22]
> +- Exchange RoundRobinPartitioning(1)
>+- *Project [id#13L AS x#16L, (id#13L + 1) AS y#17L]
>   +- *Range (0, 1, step=1, splits=8)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23021) AnalysisBarrier should not cut off the explain output for Parsed Logical Plan

2018-01-11 Thread Kris Mok (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323612#comment-16323612
 ] 

Kris Mok commented on SPARK-23021:
--

Hi [~maropu]-san,

Thanks for looking at it! No, I only noticed this behavior while working on 
something else. I wasn't planning on working on this ticket, unless no one 
takes it and I get too curious ;-)

Would you like to submit that patch as a PR?

Thanks,
Kris

> AnalysisBarrier should not cut off the explain output for Parsed Logical Plan
> -
>
> Key: SPARK-23021
> URL: https://issues.apache.org/jira/browse/SPARK-23021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>
> In PR#20094 as a follow up to SPARK-20392, there were some fixes to the 
> handling of {{AnalysisBarrier}}, but there seem to be more cases that need to 
> be fixed.
> One such case is that right now the Parsed Logical Plan in explain output 
> would be cutoff by {{AnalysisBarrier}}, e.g.
> {code:none}
> scala> val df1 = spark.range(1).select('id as 'x, 'id + 1 as 
> 'y).repartition(1).select('x === 'y)
> df1: org.apache.spark.sql.DataFrame = [(x = y): boolean]
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [('x = 'y) AS (x = y)#22]
> +- AnalysisBarrier Repartition 1, true
> == Analyzed Logical Plan ==
> (x = y): boolean
> Project [(x#16L = y#17L) AS (x = y)#22]
> +- Repartition 1, true
>+- Project [id#13L AS x#16L, (id#13L + cast(1 as bigint)) AS y#17L]
>   +- Range (0, 1, step=1, splits=Some(8))
> == Optimized Logical Plan ==
> Project [(x#16L = y#17L) AS (x = y)#22]
> +- Repartition 1, true
>+- Project [id#13L AS x#16L, (id#13L + 1) AS y#17L]
>   +- Range (0, 1, step=1, splits=Some(8))
> == Physical Plan ==
> *Project [(x#16L = y#17L) AS (x = y)#22]
> +- Exchange RoundRobinPartitioning(1)
>+- *Project [id#13L AS x#16L, (id#13L + 1) AS y#17L]
>   +- *Range (0, 1, step=1, splits=8)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec

2018-01-10 Thread Kris Mok (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Mok updated SPARK-23032:
-
Description: 
Proposing to add a per-query ID to the codegen stages as represented by 
{{WholeStageCodegenExec}} operators. This ID will be used in
* the explain output of the physical plan, and in
* the generated class name.

Specifically, this ID will be stable within a query, counting up from 1 in 
depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a 
plan.
The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} 
objects, which may have been created for one-off purposes, e.g. for fallback 
handling of codegen stages that failed to codegen the whole stage and wishes to 
codegen a subset of the children operators.

Example: for the following query:
{code:none}
scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)

scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 
'y).orderBy('x).select('x + 1 as 'z, 'y)
df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint]

scala> val df2 = spark.range(5)
df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val query = df1.join(df2, 'z === 'id)
query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more field]
{code}

The explain output before the change is:
{code:none}
scala> query.explain
== Physical Plan ==
*SortMergeJoin [z#9L], [id#13L], Inner
:- *Sort [z#9L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(z#9L, 200)
: +- *Project [(x#3L + 1) AS z#9L, y#4L]
:+- *Sort [x#3L ASC NULLS FIRST], true, 0
:   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
:  +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
: +- *Range (0, 10, step=1, splits=8)
+- *Sort [id#13L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#13L, 200)
  +- *Range (0, 5, step=1, splits=8)
{code}
Note how codegen'd operators are annotated with a prefix {{"*"}}.

and after this change it'll be:
{code:none}
scala> query.explain
== Physical Plan ==
*(6) SortMergeJoin [z#9L], [id#13L], Inner
:- *(3) Sort [z#9L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(z#9L, 200)
: +- *(2) Project [(x#3L + 1) AS z#9L, y#4L]
:+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0
:   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
:  +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
: +- *(1) Range (0, 10, step=1, splits=8)
+- *(5) Sort [id#13L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#13L, 200)
  +- *(4) Range (0, 5, step=1, splits=8)
{code}
Note that the annotated prefix becomes {{"*(id) "}}

It'll also show up in the name of the generated class, as a suffix in the 
format of
{code:none}
GeneratedClass$GeneratedIterator$id
{code}

for example, note how {{GeneratedClass$GeneratedIteratorForCodegenStage3}} and 
{{GeneratedClass$GeneratedIteratorForCodegenStage6}} in the following stack 
trace corresponds to the IDs shown in the explain output above:
{code:none}
"Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 nid=NA 
runnable
  java.lang.Thread.State: RUNNABLE
  at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.sort_addToSorter$(generated.java:32)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(generated.java:41)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.findNextInnerJoinRows$(generated.java:42)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(generated.java:101)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828)
  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at

[jira] [Created] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec

2018-01-10 Thread Kris Mok (JIRA)

Kris Mok created SPARK-23032:


 Summary: Add a per-query codegenStageId to WholeStageCodegenExec
 Key: SPARK-23032
 URL: https://issues.apache.org/jira/browse/SPARK-23032
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kris Mok


Proposing to add a per-query ID to the codegen stages as represented by 
{{WholeStageCodegenExec}} operators. This ID will be used in
* the explain output of the physical plan, and in
* the generated class name.

Specifically, this ID will be stable within a query, counting up from 1 in 
depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a 
plan.
The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} 
objects, which may have been created for one-off purposes, e.g. for fallback 
handling of codegen stages that failed to codegen the whole stage and wishes to 
codegen a subset of the children operators.

Example: for the following query:
{code:none}
scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)

scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 
'y).orderBy('x).select('x + 1 as 'z, 'y)
df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint]

scala> val df2 = spark.range(5)
df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val query = df1.join(df2, 'z === 'id)
query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more field]
{code}

The explain output before the change is:
{code:none}
scala> query.explain
== Physical Plan ==
*SortMergeJoin [z#9L], [id#13L], Inner
:- *Sort [z#9L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(z#9L, 200)
: +- *Project [(x#3L + 1) AS z#9L, y#4L]
:+- *Sort [x#3L ASC NULLS FIRST], true, 0
:   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
:  +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
: +- *Range (0, 10, step=1, splits=8)
+- *Sort [id#13L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#13L, 200)
  +- *Range (0, 5, step=1, splits=8)
{code}
Note how codegen'd operators are annotated with a prefix {{"*"}}.

and after this change it'll be:
{code:none}
scala> query.explain
== Physical Plan ==
*(6) SortMergeJoin [z#9L], [id#13L], Inner
:- *(3) Sort [z#9L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(z#9L, 200)
: +- *(2) Project [(x#3L + 1) AS z#9L, y#4L]
:+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0
:   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
:  +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
: +- *(1) Range (0, 10, step=1, splits=8)
+- *(5) Sort [id#13L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#13L, 200)
  +- *(4) Range (0, 5, step=1, splits=8)
{code}
Note that the annotated prefix becomes {{"*(id) "}}

It'll also show up in the name of the generated class, as a suffix in the 
format of
{code:none}
GeneratedClass$GeneratedIterator$id
{code}

for example, note how {{GeneratedClass$GeneratedIterator$3}} and 
{{GeneratedClass$GeneratedIterator$6}} in the following stack trace corresponds 
to the IDs shown in the explain output above:
{code:none}
"Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 nid=NA 
runnable
  java.lang.Thread.State: RUNNABLE
  at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.sort_addToSorter$(generated.java:32)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.processNext(generated.java:41)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.findNextInnerJoinRows$(generated.java:42)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.processNext(generated.java:101)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828)
  at

[jira] [Created] (SPARK-23021) AnalysisBarrier should not cut off the explain output for Parsed Logical Plan

2018-01-09 Thread Kris Mok (JIRA)

Kris Mok created SPARK-23021:


 Summary: AnalysisBarrier should not cut off the explain output for 
Parsed Logical Plan
 Key: SPARK-23021
 URL: https://issues.apache.org/jira/browse/SPARK-23021
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kris Mok
Priority: Minor


In PR#20094 as a follow up to SPARK-20392, there were some fixes to the 
handling of {{AnalysisBarrier}}, but there seem to be more cases that need to 
be fixed.

One such case is that right now the Parsed Logical Plan in explain output would 
be cutoff by {{AnalysisBarrier}}, e.g.
{code:none}
scala> val df1 = spark.range(1).select('id as 'x, 'id + 1 as 
'y).repartition(1).select('x === 'y)
df1: org.apache.spark.sql.DataFrame = [(x = y): boolean]

scala> df1.explain(true)
== Parsed Logical Plan ==
'Project [('x = 'y) AS (x = y)#22]
+- AnalysisBarrier Repartition 1, true

== Analyzed Logical Plan ==
(x = y): boolean
Project [(x#16L = y#17L) AS (x = y)#22]
+- Repartition 1, true
   +- Project [id#13L AS x#16L, (id#13L + cast(1 as bigint)) AS y#17L]
  +- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [(x#16L = y#17L) AS (x = y)#22]
+- Repartition 1, true
   +- Project [id#13L AS x#16L, (id#13L + 1) AS y#17L]
  +- Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==
*Project [(x#16L = y#17L) AS (x = y)#22]
+- Exchange RoundRobinPartitioning(1)
   +- *Project [id#13L AS x#16L, (id#13L + 1) AS y#17L]
  +- *Range (0, 1, step=1, splits=8)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime

2018-01-04 Thread Kris Mok (JIRA)

Kris Mok created SPARK-22966:


 Summary: Spark SQL should handle Python UDFs that return a 
datetime.date or datetime.datetime
 Key: SPARK-22966
 URL: https://issues.apache.org/jira/browse/SPARK-22966
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.1, 2.2.0
Reporter: Kris Mok


Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which 
should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a 
{{java.util.Calendar}} which Spark SQL doesn't understand internally, and will 
thus give incorrect results.

e.g.
{code:python}
>>> import datetime
>>> from pyspark.sql import *
>>> py_date = udf(datetime.date)
>>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
>>> lit(datetime.date(2017, 10, 30))).show()
++
|(date(2017, 10, 30) = DATE '2017-10-30')|
++
|   false|
++
{code}
(changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
'date')}} doesn't work either)

We should correctly handle Python UDFs that return objects of such types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime

2018-01-04 Thread Kris Mok (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Mok updated SPARK-22966:
-
Description: 
Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which 
should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a 
{{java.util.Calendar}} which Spark SQL doesn't understand internally, and will 
thus give incorrect results.

e.g.
{code:none}
>>> import datetime
>>> from pyspark.sql import *
>>> py_date = udf(datetime.date)
>>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
>>> lit(datetime.date(2017, 10, 30))).show()
++
|(date(2017, 10, 30) = DATE '2017-10-30')|
++
|   false|
++
{code}
(changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
'date')}} doesn't work either)

We should correctly handle Python UDFs that return objects of such types.

  was:
Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which 
should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a 
{{java.util.Calendar}} which Spark SQL doesn't understand internally, and will 
thus give incorrect results.

e.g.
{code:python}
>>> import datetime
>>> from pyspark.sql import *
>>> py_date = udf(datetime.date)
>>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
>>> lit(datetime.date(2017, 10, 30))).show()
++
|(date(2017, 10, 30) = DATE '2017-10-30')|
++
|   false|
++
{code}
(changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
'date')}} doesn't work either)

We should correctly handle Python UDFs that return objects of such types.


> Spark SQL should handle Python UDFs that return a datetime.date or 
> datetime.datetime
> 
>
> Key: SPARK-22966
> URL: https://issues.apache.org/jira/browse/SPARK-22966
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kris Mok
>
> Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
> should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} 
> (which should correspond to a Spark SQL {{timestamp}} type), it gets 
> unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand 
> internally, and will thus give incorrect results.
> e.g.
> {code:none}
> >>> import datetime
> >>> from pyspark.sql import *
> >>> py_date = udf(datetime.date)
> >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
> >>> lit(datetime.date(2017, 10, 30))).show()
> ++
> |(date(2017, 10, 30) = DATE '2017-10-30')|
> ++
> |   false|
> ++
> {code}
> (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
> 'date')}} doesn't work either)
> We should correctly handle Python UDFs that return objects of such types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21927) Spark pom.xml's dependency management is broken

2017-09-05 Thread Kris Mok (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154426#comment-16154426
 ] 

Kris Mok commented on SPARK-21927:
--

[~sowen] Could you please take a look and see if recent change to POM (e.g. 
SPARK-14280) would have caused this issue? Thanks!

> Spark pom.xml's dependency management is broken
> ---
>
> Key: SPARK-21927
> URL: https://issues.apache.org/jira/browse/SPARK-21927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
> Environment: Apache Spark current master (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d)
>Reporter: Kris Mok
>
> When building the current Spark master just now (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot 
> of warning messages such as the following. Looks like the dependency 
> management in the POMs are somehow broken recently.
> {code:none}
> .../workspace/apache-spark/master (master) $ build/sbt clean package
> Attempting to fetch sbt
> Launching sbt from build/sbt-launch-0.13.16.jar
> [info] Loading project definition from 
> .../workspace/apache-spark/master/project
> [info] Updating 
> {file:.../workspace/apache-spark/master/project/}master-build...
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms)
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms)
> [info] Done updating.
> [warn] Found version conflict(s) in library dependencies; some are suspected 
> to be binary incompatible:
> [warn] 
> [warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 
> 1.0-beta-6
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-file:2.2  (depends 
> on 2.2)
> [warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark 
> (scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2  (depends 
> on 2.2)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 1.0-beta-6)
> [warn] 
> [warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 
> 2.0.6, 2.1, 1.5.5}
> [warn] +- org.apache.maven.wagon:wagon-provider-api:2.2  (depends 
> on 3.0)
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.0.6)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-artifact:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-core:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-embedder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings-builder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 2.0.7)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-model:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-aether-provider:3.0.4   (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-repository-metadata:3.0.4   (depends 
> on 2.0.7)
> [warn] 
> [warn] * cglib:cglib is evicted completely
> [warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends 
> on 2.2.2)
> [warn] 
> [warn] * asm:asm is evicted completely
> [warn] +- cglib:cglib:2.2.2  (depends 
> on 3.3.1)
> [warn] 
> [warn] Run 'evicted' to see detailed eviction warnings
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21927) Spark pom.xml's dependency management is broken

2017-09-05 Thread Kris Mok (JIRA)

Kris Mok created SPARK-21927:


 Summary: Spark pom.xml's dependency management is broken
 Key: SPARK-21927
 URL: https://issues.apache.org/jira/browse/SPARK-21927
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.1
 Environment: Apache Spark current master (commit 
12ab7f7e89ec9e102859ab3b710815d3058a2e8d)
Reporter: Kris Mok


When building the current Spark master just now (commit 
12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot of 
warning messages such as the following. Looks like the dependency management in 
the POMs are somehow broken recently.

{code:none}
.../workspace/apache-spark/master (master) $ build/sbt clean package
Attempting to fetch sbt
Launching sbt from build/sbt-launch-0.13.16.jar
[info] Loading project definition from .../workspace/apache-spark/master/project
[info] Updating {file:.../workspace/apache-spark/master/project/}master-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading 
https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar
 ...
[info] [SUCCESSFUL ] 
org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms)
[info] downloading 
https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar
 ...
[info] [SUCCESSFUL ] 
org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms)
[info] Done updating.
[warn] Found version conflict(s) in library dependencies; some are suspected to 
be binary incompatible:
[warn] 
[warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 
1.0-beta-6
[warn] +- org.apache.maven:maven-compat:3.0.4(depends 
on 2.2)
[warn] +- org.apache.maven.wagon:wagon-file:2.2  (depends 
on 2.2)
[warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark 
(scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2)
[warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2  (depends 
on 2.2)
[warn] +- org.apache.maven.wagon:wagon-http:2.2  (depends 
on 2.2)
[warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2  (depends 
on 2.2)
[warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
on 1.0-beta-6)
[warn] 
[warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 
2.0.6, 2.1, 1.5.5}
[warn] +- org.apache.maven.wagon:wagon-provider-api:2.2  (depends 
on 3.0)
[warn] +- org.apache.maven:maven-compat:3.0.4(depends 
on 2.0.6)
[warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-artifact:3.0.4  (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-core:3.0.4  (depends 
on 2.0.6)
[warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3  (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-embedder:3.0.4  (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-settings:3.0.4  (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-settings-builder:3.0.4  (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends 
on 2.0.7)
[warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
on 2.0.7)
[warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends 
on 2.0.7)
[warn] +- org.apache.maven:maven-model:3.0.4 (depends 
on 2.0.7)
[warn] +- org.apache.maven:maven-aether-provider:3.0.4   (depends 
on 2.0.7)
[warn] +- org.apache.maven:maven-repository-metadata:3.0.4   (depends 
on 2.0.7)
[warn] 
[warn] * cglib:cglib is evicted completely
[warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends 
on 2.2.2)
[warn] 
[warn] * asm:asm is evicted completely
[warn] +- cglib:cglib:2.2.2  (depends 
on 3.3.1)
[warn] 
[warn] Run 'evicted' to see detailed eviction warnings
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21041) With whole-stage codegen, SparkSession.range()'s behavior is inconsistent with SparkContext.range()

2017-06-09 Thread Kris Mok (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Mok updated SPARK-21041:
-
Description: 
When whole-stage codegen is enabled, in face of integer overflow, 
SparkSession.range()'s behavior is inconsistent with when codegen is turned 
off, while the latter is consistent with SparkContext.range()'s behavior.

The following Spark Shell session shows the inconsistency:
{code:java}
scala> sc.range
   def range(start: Long,end: Long,step: Long,numSlices: Int): 
org.apache.spark.rdd.RDD[Long]

scala> spark.range

 
def range(start: Long,end: Long,step: Long,numPartitions: Int): 
org.apache.spark.sql.Dataset[Long]   
def range(start: Long,end: Long,step: Long): org.apache.spark.sql.Dataset[Long] 
 
def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long]
 
def range(end: Long): org.apache.spark.sql.Dataset[Long] 

scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
1).collect
res1: Array[Long] = Array()

scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
1).collect
res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 
9223372036854775806)

scala> spark.conf.set("spark.sql.codegen.wholeStage", false)

scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
1).collect
res5: Array[Long] = Array()
{code}

  was:
When whole-stage codegen is enabled, in face of integer overflow, 
SparkSession.range()'s behavior is inconsistent with when codegen is turned 
off, while the latter is consistent with SparkContext.range()'s behavior.

The following Spark Shell session shows the inconsistency:
{code:scala}
scala> sc.range
   def range(start: Long,end: Long,step: Long,numSlices: Int): 
org.apache.spark.rdd.RDD[Long]

scala> spark.range

 
def range(start: Long,end: Long,step: Long,numPartitions: Int): 
org.apache.spark.sql.Dataset[Long]   
def range(start: Long,end: Long,step: Long): org.apache.spark.sql.Dataset[Long] 
 
def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long]
 
def range(end: Long): org.apache.spark.sql.Dataset[Long] 

scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
1).collect
res1: Array[Long] = Array()

scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
1).collect
res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 
9223372036854775806)

scala> spark.conf.set("spark.sql.codegen.wholeStage", false)

scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
1).collect
res5: Array[Long] = Array()
{code}


> With whole-stage codegen, SparkSession.range()'s behavior is inconsistent 
> with SparkContext.range()
> ---
>
> Key: SPARK-21041
> URL: https://issues.apache.org/jira/browse/SPARK-21041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kris Mok
>
> When whole-stage codegen is enabled, in face of integer overflow, 
> SparkSession.range()'s behavior is inconsistent with when codegen is turned 
> off, while the latter is consistent with SparkContext.range()'s behavior.
> The following Spark Shell session shows the inconsistency:
> {code:java}
> scala> sc.range
>def range(start: Long,end: Long,step: Long,numSlices: Int): 
> org.apache.spark.rdd.RDD[Long]
> scala> spark.range
>   
>
> def range(start: Long,end: Long,step: Long,numPartitions: Int): 
> org.apache.spark.sql.Dataset[Long]   
> def range(start: Long,end: Long,step: Long): 
> org.apache.spark.sql.Dataset[Long]  
> def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long]  
>
> def range(end: Long): org.apache.spark.sql.Dataset[Long] 
> scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
> 1).collect
> res1: Array[Long] = Array()
> scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 
> 2, 1).collect
> res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 
> 9223372036854775806)
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 
> 2, 1).collect
> res5: Array[Long] = Array()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (SPARK-21041) With whole-stage codegen, SparkSession.range()'s behavior is inconsistent with SparkContext.range()

2017-06-09 Thread Kris Mok (JIRA)

Kris Mok created SPARK-21041:


 Summary: With whole-stage codegen, SparkSession.range()'s behavior 
is inconsistent with SparkContext.range()
 Key: SPARK-21041
 URL: https://issues.apache.org/jira/browse/SPARK-21041
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kris Mok


When whole-stage codegen is enabled, in face of integer overflow, 
SparkSession.range()'s behavior is inconsistent with when codegen is turned 
off, while the latter is consistent with SparkContext.range()'s behavior.

The following Spark Shell session shows the inconsistency:
{code:scala}
scala> sc.range
   def range(start: Long,end: Long,step: Long,numSlices: Int): 
org.apache.spark.rdd.RDD[Long]

scala> spark.range

 
def range(start: Long,end: Long,step: Long,numPartitions: Int): 
org.apache.spark.sql.Dataset[Long]   
def range(start: Long,end: Long,step: Long): org.apache.spark.sql.Dataset[Long] 
 
def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long]
 
def range(end: Long): org.apache.spark.sql.Dataset[Long] 

scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
1).collect
res1: Array[Long] = Array()

scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
1).collect
res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 
9223372036854775806)

scala> spark.conf.set("spark.sql.codegen.wholeStage", false)

scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
1).collect
res5: Array[Long] = Array()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21038) Reduce redundant generated init code in Catalyst codegen

2017-06-09 Thread Kris Mok (JIRA)

Kris Mok created SPARK-21038:


 Summary: Reduce redundant generated init code in Catalyst codegen
 Key: SPARK-21038
 URL: https://issues.apache.org/jira/browse/SPARK-21038
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kris Mok
Priority: Minor


In Java, instance fields are guaranteed to be first initialized to their 
corresponding default values (zero values) before the constructor is invoked. 
Thus, explicitly code to initialize fields to their zero values is redundant 
and should be avoided.

It's usually harmless to have such code in hand-written constructors, but in 
the case of mechanically generating code, such code could contribute to a 
significant portion of the code size and cause issues.

This ticket is a step in reducing the likelihood of hitting the 64KB bytecode 
method size limit in the Java Class files. The proposal is to use some simple 
heuristics to filter out redundant code of initializing mutable state to their 
default (zero) values in CodegenContext.addMutableState. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20872) ShuffleExchange.nodeName should handle null coordinator

2017-05-24 Thread Kris Mok (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Mok updated SPARK-20872:
-
Description: 
A ShuffleExchange's coordinator can be null sometimes, and when we need to do a 
toString() on it, it'll go to ShuffleExchange.nodeName() and throw a MatchError 
there because of inexhaustive match -- the match only handles Some and None, 
but doesn't handle null.

An example of this issue is when trying to inspect a Catalyst physical operator 
on the Executor side in an IDE:
{code:none}
child = {WholeStageCodegenExec@13881} Method threw 'scala.MatchError' 
exception. Cannot evaluate 
org.apache.spark.sql.execution.WholeStageCodegenExec.toString()
{code}
where this WholeStageCodegenExec transitively references a ShuffleExchange.
On the Executor side, this ShuffleExchange instance is deserialized from the 
data sent over from the Driver, and because the coordinator field is marked 
transient, it's not carried over to the Executor, that's why it can be null 
upon inspection.

  was:
A ShuffleExchange's coordinator can be null sometimes, and when we need to do a 
toString() on it, it'll go to ShuffleExchange.nodeName() and throw a MatchError 
there because of inexhaustive match -- the match only handles Some and None, 
but doesn't handle null.

An example of this issue is when trying to inspect a Catalyst physical operator 
in an IDE:
{code:none}
child = {WholeStageCodegenExec@13881} Method threw 'scala.MatchError' 
exception. Cannot evaluate 
org.apache.spark.sql.execution.WholeStageCodegenExec.toString()
{code}
where this WholeStageCodegenExec transitively references a ShuffleExchange.


> ShuffleExchange.nodeName should handle null coordinator
> ---
>
> Key: SPARK-20872
> URL: https://issues.apache.org/jira/browse/SPARK-20872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kris Mok
>Priority: Trivial
>  Labels: easyfix
>
> A ShuffleExchange's coordinator can be null sometimes, and when we need to do 
> a toString() on it, it'll go to ShuffleExchange.nodeName() and throw a 
> MatchError there because of inexhaustive match -- the match only handles Some 
> and None, but doesn't handle null.
> An example of this issue is when trying to inspect a Catalyst physical 
> operator on the Executor side in an IDE:
> {code:none}
> child = {WholeStageCodegenExec@13881} Method threw 'scala.MatchError' 
> exception. Cannot evaluate 
> org.apache.spark.sql.execution.WholeStageCodegenExec.toString()
> {code}
> where this WholeStageCodegenExec transitively references a ShuffleExchange.
> On the Executor side, this ShuffleExchange instance is deserialized from the 
> data sent over from the Driver, and because the coordinator field is marked 
> transient, it's not carried over to the Executor, that's why it can be null 
> upon inspection.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20872) ShuffleExchange.nodeName should handle null coordinator

2017-05-24 Thread Kris Mok (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023421#comment-16023421
 ] 

Kris Mok commented on SPARK-20872:
--

The said matching logic in ShuffleExchange.nodeName() is introduced from 
SPARK-9858.

> ShuffleExchange.nodeName should handle null coordinator
> ---
>
> Key: SPARK-20872
> URL: https://issues.apache.org/jira/browse/SPARK-20872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kris Mok
>Priority: Trivial
>  Labels: easyfix
>
> A ShuffleExchange's coordinator can be null sometimes, and when we need to do 
> a toString() on it, it'll go to ShuffleExchange.nodeName() and throw a 
> MatchError there because of inexhaustive match -- the match only handles Some 
> and None, but doesn't handle null.
> An example of this issue is when trying to inspect a Catalyst physical 
> operator in an IDE:
> {code:none}
> child = {WholeStageCodegenExec@13881} Method threw 'scala.MatchError' 
> exception. Cannot evaluate 
> org.apache.spark.sql.execution.WholeStageCodegenExec.toString()
> {code}
> where this WholeStageCodegenExec transitively references a ShuffleExchange.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20872) ShuffleExchange.nodeName should handle null coordinator

2017-05-24 Thread Kris Mok (JIRA)

Kris Mok created SPARK-20872:


 Summary: ShuffleExchange.nodeName should handle null coordinator
 Key: SPARK-20872
 URL: https://issues.apache.org/jira/browse/SPARK-20872
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kris Mok
Priority: Trivial


A ShuffleExchange's coordinator can be null sometimes, and when we need to do a 
toString() on it, it'll go to ShuffleExchange.nodeName() and throw a MatchError 
there because of inexhaustive match -- the match only handles Some and None, 
but doesn't handle null.

An example of this issue is when trying to inspect a Catalyst physical operator 
in an IDE:
{code:none}
child = {WholeStageCodegenExec@13881} Method threw 'scala.MatchError' 
exception. Cannot evaluate 
org.apache.spark.sql.execution.WholeStageCodegenExec.toString()
{code}
where this WholeStageCodegenExec transitively references a ShuffleExchange.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20482) Resolving Casts is too strict on having time zone set

2017-04-26 Thread Kris Mok (JIRA)

Kris Mok created SPARK-20482:


 Summary: Resolving Casts is too strict on having time zone set
 Key: SPARK-20482
 URL: https://issues.apache.org/jira/browse/SPARK-20482
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kris Mok






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

62 matches

Mail list logo