[jira] [Updated] (SPARK-17617) Remainder(%) expression.eval returns incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17617: --- Description: h2.Problem Remainder(%) expression returns incorrect result when using expression.eval to calculate the result. expression.eval is called in case like constant folding. {code} scala> -5083676433652386516D % 10 res19: Double = -6.0 // Wrong answer with eval!!! scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show |(value % 10)| ++ | 0.0| ++ // Triggers codegen, will not do constant folding scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 10).show ++ |(value % 10)| ++ |-6.0| ++ {code} Behavior of postgres: {code} seanzhong=# select -5083676433652386516.0 % 10; ?column? -- -6.0 (1 row) {code} was: h2.Problem Remainder(%) expression returns incorrect result when eval is called to do constant folding {code} scala> -5083676433652386516D % 10 res19: Double = -6.0 // Wrong answer because of constant folding!!! scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show |(value % 10)| ++ | 0.0| ++ // Triggers codegen, will not do constant folding scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 10).show ++ |(value % 10)| ++ |-6.0| ++ {code} Behavior of postgres: {code} seanzhong=# select -5083676433652386516.0 % 10; ?column? -- -6.0 (1 row) {code} > Remainder(%) expression.eval returns incorrect result > - > > Key: SPARK-17617 > URL: https://issues.apache.org/jira/browse/SPARK-17617 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > h2.Problem > Remainder(%) expression returns incorrect result when using expression.eval > to calculate the result. expression.eval is called in case like constant > folding. > {code} > scala> -5083676433652386516D % 10 > res19: Double = -6.0 > // Wrong answer with eval!!! > scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show > |(value % 10)| > ++ > | 0.0| > ++ > // Triggers codegen, will not do constant folding > scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % > 10).show > ++ > |(value % 10)| > ++ > |-6.0| > ++ > {code} > Behavior of postgres: > {code} > seanzhong=# select -5083676433652386516.0 % 10; > ?column? > -- > -6.0 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17617) Remainder(%) expression.eval returns incorrect result
Sean Zhong created SPARK-17617: -- Summary: Remainder(%) expression.eval returns incorrect result Key: SPARK-17617 URL: https://issues.apache.org/jira/browse/SPARK-17617 Project: Spark Issue Type: Bug Components: SQL Reporter: Sean Zhong h2.Problem Remainder(%) expression returns incorrect result when eval is called to do constant folding {code} scala> -5083676433652386516D % 10 res19: Double = -6.0 // Wrong answer because of constant folding!!! scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show |(value % 10)| ++ | 0.0| ++ // Triggers codegen, will not do constant folding scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 10).show ++ |(value % 10)| ++ |-6.0| ++ {code} Behavior of postgres: {code} seanzhong=# select -5083676433652386516.0 % 10; ?column? -- -6.0 (1 row) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17503: --- Attachment: Screen Shot 2016-09-12 at 4.34.19 PM.png Screen Shot 2016-09-12 at 4.16.15 PM.png > Memory leak in Memory store when unable to cache the whole RDD in memory > > > Key: SPARK-17503 > URL: https://issues.apache.org/jira/browse/SPARK-17503 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Sean Zhong > Attachments: Screen Shot 2016-09-12 at 4.16.15 PM.png, Screen Shot > 2016-09-12 at 4.34.19 PM.png > > > h2.Problem description: > The following query triggers out of memory error. > {code} > sc.parallelize(1 to 10, 100).map(x => new > Array[Long](1000)).cache().count() > {code} > This is not expected, we should fallback to use disk instead if there is not > enough memory for cache. > Stacktrace: > {code} > scala> sc.parallelize(1 to 10, 100).map(x => new > Array[Long](1000)).cache().count() > [Stage 0:> (0 + 5) / > 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed > 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in > memory! (computed 947.3 MB so far) > 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed > 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in > memory! (computed 1423.7 MB so far) > 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed > java.lang.OutOfMemoryError: Java heap space > Dumping heap to java_pid26528.hprof ... > Heap dump file created [6551021666 bytes in 9.876 secs] > 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) > 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, > 55360))] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.executor.heartbeatInterval > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17503: --- Description: h2.Problem description: The following query triggers out of memory error. {code} sc.parallelize(1 to 10, 100).map(x => new Array[Long](1000)).cache().count() {code} This is not expected, we should fallback to use disk instead if there is not enough memory for cache. Stacktrace: {code} scala> sc.parallelize(1 to 10, 100).map(x => new Array[Long](1000)).cache().count() [Stage 0:> (0 + 5) / 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in memory! (computed 947.3 MB so far) 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in memory! (computed 1423.7 MB so far) 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid26528.hprof ... Heap dump file created [6551021666 bytes in 9.876 secs] 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 55360))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) ... 14 more 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) java.lang.OutOfMemoryError: Java heap space at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232) at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17503: --- Affects Version/s: (was: 1.6.2) (was: 2.0.0) Target Version/s: 2.1.0 (was: 2.0.0, 2.1.0) > Memory leak in Memory store when unable to cache the whole RDD in memory > > > Key: SPARK-17503 > URL: https://issues.apache.org/jira/browse/SPARK-17503 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Sean Zhong > > h2.Problem description: > The following query triggers out of memory error. > {code} > sc.parallelize(1 to 1000, 5).map(x => new Array[Long](1000)).cache().count > {code} > This is not expected, we should fallback to use disk instead if there is not > enough memory for cache. > Stacktrace: > {code} > scala> sc.parallelize(1 to 1000, 5).map(x => new > Array[Long](1000)).cache().count > [Stage 0:> (0 + 5) / > 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed > 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in > memory! (computed 947.3 MB so far) > 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed > 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in > memory! (computed 1423.7 MB so far) > 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed > java.lang.OutOfMemoryError: Java heap space > Dumping heap to java_pid26528.hprof ... > Heap dump file created [6551021666 bytes in 9.876 secs] > 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) > 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, > 55360))] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.executor.heartbeatInterval > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) > ... 14 more > 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) >
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17503: --- Description: h2.Problem description: The following query triggers out of memory error. {code} sc.parallelize(1 to 1000, 5).map(x => new Array[Long](1000)).cache().count {code} This is not expected, we should fallback to use disk instead if there is not enough memory for cache. Stacktrace: {code} scala> sc.parallelize(1 to 1000, 5).map(x => new Array[Long](1000)).cache().count [Stage 0:> (0 + 5) / 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in memory! (computed 947.3 MB so far) 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in memory! (computed 1423.7 MB so far) 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid26528.hprof ... Heap dump file created [6551021666 bytes in 9.876 secs] 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 55360))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) ... 14 more 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) java.lang.OutOfMemoryError: Java heap space at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232) at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) at
[jira] [Commented] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483389#comment-15483389 ] Sean Zhong commented on SPARK-17503: [~sowen] I have modified the title to mean "cache in memory" > Memory leak in Memory store when unable to cache the whole RDD > -- > > Key: SPARK-17503 > URL: https://issues.apache.org/jira/browse/SPARK-17503 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0, 2.1.0 >Reporter: Sean Zhong > > h2.Problem description: > The following query triggers out of memory error. > {code} > sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count > {code} > This is not expected, we should fallback to use disk instead if there is not > enough memory for cache. > Stacktrace: > {code} > scala> sc.parallelize(1 to 1000, 5).map(f).cache().count > [Stage 0:> (0 + 5) / > 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed > 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in > memory! (computed 947.3 MB so far) > 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed > 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in > memory! (computed 1423.7 MB so far) > 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed > java.lang.OutOfMemoryError: Java heap space > Dumping heap to java_pid26528.hprof ... > Heap dump file created [6551021666 bytes in 9.876 secs] > 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) > 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, > 55360))] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.executor.heartbeatInterval > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) > ... 14 more > 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) > java.lang.OutOfMemoryError: Java heap space > at >
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17503: --- Summary: Memory leak in Memory store when unable to cache the whole RDD in memory (was: Memory leak in Memory store when unable to cache the whole RDD) > Memory leak in Memory store when unable to cache the whole RDD in memory > > > Key: SPARK-17503 > URL: https://issues.apache.org/jira/browse/SPARK-17503 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0, 2.1.0 >Reporter: Sean Zhong > > h2.Problem description: > The following query triggers out of memory error. > {code} > sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count > {code} > This is not expected, we should fallback to use disk instead if there is not > enough memory for cache. > Stacktrace: > {code} > scala> sc.parallelize(1 to 1000, 5).map(f).cache().count > [Stage 0:> (0 + 5) / > 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed > 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in > memory! (computed 947.3 MB so far) > 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed > 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in > memory! (computed 1423.7 MB so far) > 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed > java.lang.OutOfMemoryError: Java heap space > Dumping heap to java_pid26528.hprof ... > Heap dump file created [6551021666 bytes in 9.876 secs] > 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) > 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, > 55360))] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.executor.heartbeatInterval > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) > ... 14 more > 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17503: --- Description: h2.Problem description: The following query triggers out of memory error. {code} sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count {code} This is not expected, we should fallback to use disk instead if there is not enough memory for cache. Stacktrace: {code} scala> sc.parallelize(1 to 1000, 5).map(f).cache().count [Stage 0:> (0 + 5) / 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in memory! (computed 947.3 MB so far) 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in memory! (computed 1423.7 MB so far) 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid26528.hprof ... Heap dump file created [6551021666 bytes in 9.876 secs] 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 55360))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) ... 14 more 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) java.lang.OutOfMemoryError: Java heap space at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232) at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) at
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17503: --- Description: h2.Problem description: The following query triggers out of memory error. {code} sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count {code} This is not expected, we should fallback to use disk instead if there is not enough memory for cache. Stacktrace: {code} scala> sc.parallelize(1 to 1000, 5).map(f).cache().count [Stage 0:> (0 + 5) / 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in memory! (computed 947.3 MB so far) 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in memory! (computed 1423.7 MB so far) 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid26528.hprof ... Heap dump file created [6551021666 bytes in 9.876 secs] 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 55360))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) ... 14 more 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) java.lang.OutOfMemoryError: Java heap space at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232) at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) at
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17503: --- Description: h2.Problem description: The following query triggers out of memory error. {code} sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count {code} This is not expected, we should fallback to use disk instead if there is not enough memory for cache. Stacktrace: {code} scala> sc.parallelize(1 to 1000, 5).map(f).cache().count [Stage 0:> (0 + 5) / 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in memory! (computed 947.3 MB so far) 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in memory! (computed 1423.7 MB so far) 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid26528.hprof ... Heap dump file created [6551021666 bytes in 9.876 secs] 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 55360))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) ... 14 more 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) java.lang.OutOfMemoryError: Java heap space at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232) at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) at
[jira] [Created] (SPARK-17503) Memory leak in Memory store which unable to cache whole RDD
Sean Zhong created SPARK-17503: -- Summary: Memory leak in Memory store which unable to cache whole RDD Key: SPARK-17503 URL: https://issues.apache.org/jira/browse/SPARK-17503 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0, 1.6.2, 2.1.0 Reporter: Sean Zhong h2.Problem description The following query triggers out of memory error. {code} sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count {code} This is not expected, we should fallback to use disk instead if there is not enough memory for cache. Stacktrace: {code} scala> sc.parallelize(1 to 1000, 5).map(f).cache().count [Stage 0:> (0 + 5) / 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in memory! (computed 631.5 MB so far) 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in memory! (computed 947.3 MB so far) 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in memory! (computed 1423.7 MB so far) 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid26528.hprof ... Heap dump file created [6551021666 bytes in 9.876 secs] 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 55360))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) ... 14 more 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) java.lang.OutOfMemoryError: Java heap space at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232) at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17503: --- Summary: Memory leak in Memory store when unable to cache the whole RDD (was: Memory leak in Memory store which unable to cache whole RDD) > Memory leak in Memory store when unable to cache the whole RDD > -- > > Key: SPARK-17503 > URL: https://issues.apache.org/jira/browse/SPARK-17503 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0, 2.1.0 >Reporter: Sean Zhong > > h2.Problem description > The following query triggers out of memory error. > {code} > sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count > {code} > This is not expected, we should fallback to use disk instead if there is not > enough memory for cache. > Stacktrace: > {code} > scala> sc.parallelize(1 to 1000, 5).map(f).cache().count > [Stage 0:> (0 + 5) / > 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed > 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in > memory! (computed 947.3 MB so far) > 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed > 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in > memory! (computed 1423.7 MB so far) > 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed > java.lang.OutOfMemoryError: Java heap space > Dumping heap to java_pid26528.hprof ... > Heap dump file created [6551021666 bytes in 9.876 secs] > 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) > 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, > 55360))] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.executor.heartbeatInterval > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) > ... 14 more > 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) > java.lang.OutOfMemoryError:
[jira] [Commented] (SPARK-17364) Can not query hive table starting with number
[ https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472189#comment-15472189 ] Sean Zhong commented on SPARK-17364: I have a trial fix at https://github.com/apache/spark/pull/15006 Which will tokenize temp.20160826_ip_list as: {code} temp // Matches the IDENTIFIER lexer rule . // Matches single dot 20160826_ip_list // Matches the IDENTIFIER lexer rule. {code} > Can not query hive table starting with number > - > > Key: SPARK-17364 > URL: https://issues.apache.org/jira/browse/SPARK-17364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Egor Pahomov > > I can do it with spark-1.6.2 > {code} > SELECT * from temp.20160826_ip_list limit 100 > {code} > {code} > Error: org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', > 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', > 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', > 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19) > == SQL == > SELECT * from temp.20160826_ip_list limit 100 > ---^^^ > SQLState: null > ErrorCode: 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17364) Can not query hive table starting with number
[ https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472182#comment-15472182 ] Sean Zhong edited comment on SPARK-17364 at 9/7/16 11:56 PM: - [~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to tokens like {code} // temp.20160826_ip_list is break downs to: temp // Matches the IDENTIFIER lexer rule .20160826 // Matches the DECIMAL_VALUE lexer rule. _ip_list // Matches the IDENTIFIER lexer rule. {code} was (Author: clockfly): [~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to tokens like {code} // temp.20160826_ip_list is break downs to: temp .20160826 _ip_list {code} > Can not query hive table starting with number > - > > Key: SPARK-17364 > URL: https://issues.apache.org/jira/browse/SPARK-17364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Egor Pahomov > > I can do it with spark-1.6.2 > {code} > SELECT * from temp.20160826_ip_list limit 100 > {code} > {code} > Error: org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', > 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', > 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', > 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19) > == SQL == > SELECT * from temp.20160826_ip_list limit 100 > ---^^^ > SQLState: null > ErrorCode: 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17364) Can not query hive table starting with number
[ https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472182#comment-15472182 ] Sean Zhong edited comment on SPARK-17364 at 9/7/16 11:55 PM: - [~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to tokens like {code} // temp.20160826_ip_list is break downs to: temp .20160826 _ip_list {code} was (Author: clockfly): [~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to token like {code} // temp.20160826_ip_list is break downs to: temp .20160826 _ip_list {code} > Can not query hive table starting with number > - > > Key: SPARK-17364 > URL: https://issues.apache.org/jira/browse/SPARK-17364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Egor Pahomov > > I can do it with spark-1.6.2 > {code} > SELECT * from temp.20160826_ip_list limit 100 > {code} > {code} > Error: org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', > 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', > 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', > 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19) > == SQL == > SELECT * from temp.20160826_ip_list limit 100 > ---^^^ > SQLState: null > ErrorCode: 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17364) Can not query hive table starting with number
[ https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472182#comment-15472182 ] Sean Zhong commented on SPARK-17364: [~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to token like {code} // temp.20160826_ip_list is break downs to: temp .20160826 _ip_list {code} > Can not query hive table starting with number > - > > Key: SPARK-17364 > URL: https://issues.apache.org/jira/browse/SPARK-17364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Egor Pahomov > > I can do it with spark-1.6.2 > {code} > SELECT * from temp.20160826_ip_list limit 100 > {code} > {code} > Error: org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', > 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', > 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', > 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19) > == SQL == > SELECT * from temp.20160826_ip_list limit 100 > ---^^^ > SQLState: null > ErrorCode: 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases
[ https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17426: --- Target Version/s: 2.1.0 Description: In SPARK-17356, we fix the OOM issue when Metadata is super big. There are other cases that may also trigger OOM. Current implementation of TreeNode.toJSON will recursively search and print all fields of current TreeNode, even if the field's type is of type Seq or type Map. This is not safe because: 1. the Seq or Map can be very big. Converting them to JSON make take huge memory, which may trigger out of memory error. 2. Some user space input may also be propagated to the Plan. The input can be of arbitrary type, and may also be self-referencing. Trying to print user space to JSON input is very risky. The following example triggers a StackOverflowError when calling toJSON on a plan with user defined UDF. {code} case class SelfReferenceUDF( var config: Map[String, Any] = Map.empty[String, Any]) extends Function1[String, Boolean] { config += "self" -> this def apply(key: String): Boolean = config.contains(key) } test("toJSON should not throws java.lang.StackOverflowError") { val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr)) // triggers java.lang.StackOverflowError udf.toJSON } {code} was: In SPARK-17356, we fix the OOM issue when {monospace}Metadata{monospace} is super big. There are other cases that may also trigger OOM. Current implementation of TreeNode.toJSON will recursively search and print all fields of current TreeNode, even if the field's type is of type Seq or type Map. This is not safe because: 1. the Seq or Map can be very big. Converting them to JSON make take huge memory, which may trigger out of memory error. 2. Some user space input may also be propagated to the Plan. The input can be of arbitrary type, and may also be self-referencing. Trying to print user space to JSON input is very risky. The following example triggers a StackOverflowError when calling toJSON on a plan with user defined UDF. {code} case class SelfReferenceUDF( var config: Map[String, Any] = Map.empty[String, Any]) extends Function1[String, Boolean] { config += "self" -> this def apply(key: String): Boolean = config.contains(key) } test("toJSON should not throws java.lang.StackOverflowError") { val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr)) // triggers java.lang.StackOverflowError udf.toJSON } {code} > Current TreeNode.toJSON may trigger OOM under some corner cases > --- > > Key: SPARK-17426 > URL: https://issues.apache.org/jira/browse/SPARK-17426 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > In SPARK-17356, we fix the OOM issue when Metadata is super big. There are > other cases that may also trigger OOM. Current implementation of > TreeNode.toJSON will recursively search and print all fields of current > TreeNode, even if the field's type is of type Seq or type Map. > This is not safe because: > 1. the Seq or Map can be very big. Converting them to JSON make take huge > memory, which may trigger out of memory error. > 2. Some user space input may also be propagated to the Plan. The input can be > of arbitrary type, and may also be self-referencing. Trying to print user > space to JSON input is very risky. > The following example triggers a StackOverflowError when calling toJSON on a > plan with user defined UDF. > {code} > case class SelfReferenceUDF( > var config: Map[String, Any] = Map.empty[String, Any]) extends > Function1[String, Boolean] { > config += "self" -> this > def apply(key: String): Boolean = config.contains(key) > } > test("toJSON should not throws java.lang.StackOverflowError") { > val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr)) > // triggers java.lang.StackOverflowError > udf.toJSON > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases
[ https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17426: --- Description: In SPARK-17356, we fix the OOM issue when Metadata is super big. There are other cases that may also trigger OOM. Current implementation of TreeNode.toJSON will recursively search and print all fields of current TreeNode, even if the field's type is of type Seq or type Map. This is not safe because: 1. the Seq or Map can be very big. Converting them to JSON make take huge memory, which may trigger out of memory error. 2. Some user space input may also be propagated to the Plan. The user space input can be of arbitrary type, and may also be self-referencing. Trying to print user space input to JSON is very risky. The following example triggers a StackOverflowError when calling toJSON on a plan with user defined UDF. {code} case class SelfReferenceUDF( var config: Map[String, Any] = Map.empty[String, Any]) extends Function1[String, Boolean] { config += "self" -> this def apply(key: String): Boolean = config.contains(key) } test("toJSON should not throws java.lang.StackOverflowError") { val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr)) // triggers java.lang.StackOverflowError udf.toJSON } {code} was: In SPARK-17356, we fix the OOM issue when Metadata is super big. There are other cases that may also trigger OOM. Current implementation of TreeNode.toJSON will recursively search and print all fields of current TreeNode, even if the field's type is of type Seq or type Map. This is not safe because: 1. the Seq or Map can be very big. Converting them to JSON make take huge memory, which may trigger out of memory error. 2. Some user space input may also be propagated to the Plan. The input can be of arbitrary type, and may also be self-referencing. Trying to print user space to JSON input is very risky. The following example triggers a StackOverflowError when calling toJSON on a plan with user defined UDF. {code} case class SelfReferenceUDF( var config: Map[String, Any] = Map.empty[String, Any]) extends Function1[String, Boolean] { config += "self" -> this def apply(key: String): Boolean = config.contains(key) } test("toJSON should not throws java.lang.StackOverflowError") { val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr)) // triggers java.lang.StackOverflowError udf.toJSON } {code} > Current TreeNode.toJSON may trigger OOM under some corner cases > --- > > Key: SPARK-17426 > URL: https://issues.apache.org/jira/browse/SPARK-17426 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > In SPARK-17356, we fix the OOM issue when Metadata is super big. There are > other cases that may also trigger OOM. Current implementation of > TreeNode.toJSON will recursively search and print all fields of current > TreeNode, even if the field's type is of type Seq or type Map. > This is not safe because: > 1. the Seq or Map can be very big. Converting them to JSON make take huge > memory, which may trigger out of memory error. > 2. Some user space input may also be propagated to the Plan. The user space > input can be of arbitrary type, and may also be self-referencing. Trying to > print user space input to JSON is very risky. > The following example triggers a StackOverflowError when calling toJSON on a > plan with user defined UDF. > {code} > case class SelfReferenceUDF( > var config: Map[String, Any] = Map.empty[String, Any]) extends > Function1[String, Boolean] { > config += "self" -> this > def apply(key: String): Boolean = config.contains(key) > } > test("toJSON should not throws java.lang.StackOverflowError") { > val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr)) > // triggers java.lang.StackOverflowError > udf.toJSON > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases
[ https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17426: --- Component/s: SQL > Current TreeNode.toJSON may trigger OOM under some corner cases > --- > > Key: SPARK-17426 > URL: https://issues.apache.org/jira/browse/SPARK-17426 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > In SPARK-17356, we fix the OOM issue when {monospace}Metadata{monospace} is > super big. There are other cases that may also trigger OOM. Current > implementation of TreeNode.toJSON will recursively search and print all > fields of current TreeNode, even if the field's type is of type Seq or type > Map. > This is not safe because: > 1. the Seq or Map can be very big. Converting them to JSON make take huge > memory, which may trigger out of memory error. > 2. Some user space input may also be propagated to the Plan. The input can be > of arbitrary type, and may also be self-referencing. Trying to print user > space to JSON input is very risky. > The following example triggers a StackOverflowError when calling toJSON on a > plan with user defined UDF. > {code} > case class SelfReferenceUDF( > var config: Map[String, Any] = Map.empty[String, Any]) extends > Function1[String, Boolean] { > config += "self" -> this > def apply(key: String): Boolean = config.contains(key) > } > test("toJSON should not throws java.lang.StackOverflowError") { > val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr)) > // triggers java.lang.StackOverflowError > udf.toJSON > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases
Sean Zhong created SPARK-17426: -- Summary: Current TreeNode.toJSON may trigger OOM under some corner cases Key: SPARK-17426 URL: https://issues.apache.org/jira/browse/SPARK-17426 Project: Spark Issue Type: Bug Reporter: Sean Zhong In SPARK-17356, we fix the OOM issue when {monospace}Metadata{monospace} is super big. There are other cases that may also trigger OOM. Current implementation of TreeNode.toJSON will recursively search and print all fields of current TreeNode, even if the field's type is of type Seq or type Map. This is not safe because: 1. the Seq or Map can be very big. Converting them to JSON make take huge memory, which may trigger out of memory error. 2. Some user space input may also be propagated to the Plan. The input can be of arbitrary type, and may also be self-referencing. Trying to print user space to JSON input is very risky. The following example triggers a StackOverflowError when calling toJSON on a plan with user defined UDF. {code} case class SelfReferenceUDF( var config: Map[String, Any] = Map.empty[String, Any]) extends Function1[String, Boolean] { config += "self" -> this def apply(key: String): Boolean = config.contains(key) } test("toJSON should not throws java.lang.StackOverflowError") { val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr)) // triggers java.lang.StackOverflowError udf.toJSON } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17302) Cannot set non-Spark SQL session variables in hive-site.xml, spark-defaults.conf, or using --conf
[ https://issues.apache.org/jira/browse/SPARK-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465818#comment-15465818 ] Sean Zhong commented on SPARK-17302: [~rdblue] Can you write a reproducer code sample to describe your problem? To elaborate what's your expected behavior, and what is your current observation? I am not sure whether the {monospace}spark.conf.set("key", "value"){monospace} is what you expect or not. > Cannot set non-Spark SQL session variables in hive-site.xml, > spark-defaults.conf, or using --conf > - > > Key: SPARK-17302 > URL: https://issues.apache.org/jira/browse/SPARK-17302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue > > When configuration changed for 2.0 to the new SparkSession structure, Spark > stopped using Hive's internal HiveConf for session state and now uses > HiveSessionState and an associated SQLConf. Now, session options like > hive.exec.compress.output and hive.exec.dynamic.partition.mode are pulled > from this SQLConf. This doesn't include session properties from hive-site.xml > (including hive.exec.compress.output), and no longer contains Spark-specific > overrides from spark-defaults.conf that used the spark.hadoop.hive... pattern. > Also, setting these variables on the command-line no longer works because > settings must start with "spark.". > Is there a recommended way to set Hive session properties? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17364) Can not query hive table starting with number
[ https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465754#comment-15465754 ] Sean Zhong edited comment on SPARK-17364 at 9/5/16 8:47 PM: [~epahomov] Spark 2.0 rewrote the Sql parser with antlr. You can use the backticked version like Yuming's example. was (Author: clockfly): [~epahomov] Spark 2.0 rewrites the Sql parser with antlr. You can use the backticked version like Yuming's example. > Can not query hive table starting with number > - > > Key: SPARK-17364 > URL: https://issues.apache.org/jira/browse/SPARK-17364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Egor Pahomov > > I can do it with spark-1.6.2 > {code} > SELECT * from temp.20160826_ip_list limit 100 > {code} > {code} > Error: org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', > 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', > 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', > 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19) > == SQL == > SELECT * from temp.20160826_ip_list limit 100 > ---^^^ > SQLState: null > ErrorCode: 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17364) Can not query hive table starting with number
[ https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465754#comment-15465754 ] Sean Zhong commented on SPARK-17364: [~epahomov] Spark 2.0 rewrites the Sql parser with antlr. You can use the backticked version like Yuming's example. > Can not query hive table starting with number > - > > Key: SPARK-17364 > URL: https://issues.apache.org/jira/browse/SPARK-17364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Egor Pahomov > > I can do it with spark-1.6.2 > {code} > SELECT * from temp.20160826_ip_list limit 100 > {code} > {code} > Error: org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', > 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', > 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', > 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19) > == SQL == > SELECT * from temp.20160826_ip_list limit 100 > ---^^^ > SQLState: null > ErrorCode: 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17374) Improves the error message when fails to parse some json file lines in DataFrameReader
Sean Zhong created SPARK-17374: -- Summary: Improves the error message when fails to parse some json file lines in DataFrameReader Key: SPARK-17374 URL: https://issues.apache.org/jira/browse/SPARK-17374 Project: Spark Issue Type: Bug Components: SQL Reporter: Sean Zhong Demo case: We silently replace corrupted line with null without any error message. {code} import org.apache.spark.sql.types._ val corruptRecords = spark.sparkContext.parallelize("""{"a":{, b:3}""" :: Nil) val schema = StructType(StructField("a", StringType, true) :: Nil) val jsonDF = spark.read.schema(schema).json(corruptRecords) scala> jsonDF.show ++ | a| ++ |null| ++ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17369) MetastoreRelation toJSON throws exception
Sean Zhong created SPARK-17369: -- Summary: MetastoreRelation toJSON throws exception Key: SPARK-17369 URL: https://issues.apache.org/jira/browse/SPARK-17369 Project: Spark Issue Type: Bug Components: SQL Reporter: Sean Zhong Priority: Minor MetastoreRelationSuite.toJSON now throws exception. test case: {code} package org.apache.spark.sql.hive import org.apache.spark.SparkFunSuite import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTableType} import org.apache.spark.sql.types.{IntegerType, StructField, StructType} class MetastoreRelationSuite extends SparkFunSuite { test("makeCopy and toJSON should work") { val table = CatalogTable( identifier = TableIdentifier("test", Some("db")), tableType = CatalogTableType.VIEW, storage = CatalogStorageFormat.empty, schema = StructType(StructField("a", IntegerType, true) :: Nil)) val relation = MetastoreRelation("db", "test")(table, null, null) // No exception should be thrown relation.makeCopy(Array("db", "test")) // No exception should be thrown relation.toJSON } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17356) Out of memory when calling TreeNode.toJSON
[ https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454553#comment-15454553 ] Sean Zhong commented on SPARK-17356: Reproducer: {code} # Trigger OOM scala> :paste -raw // Entering paste mode (ctrl-D to finish) package org.apache.spark.ml.attribute import org.apache.spark.ml.attribute._ import org.apache.spark.sql.catalyst.expressions.{Alias, Literal} import org.apache.spark.sql.catalyst.plans.logical.LocalRelation import org.apache.spark.sql.catalyst.dsl.plans._ object Test { def main(args: Array[String]): Unit = { val rand = new java.util.Random() val attr: Attribute = new BinaryAttribute(Some("a"), Some(rand.nextInt(10)), Some(Array("value1", "value2"))) val attributeGroup = new AttributeGroup("group", Array.fill(100)(attr)) val alias = Alias(Literal(0), "alias")(explicitMetadata = Some(attributeGroup.toMetadata())) val testRelation = LocalRelation() val query = testRelation.select((0 to 100).toSeq.map(_ => alias): _*) System.out.print(query.toJSON.length) } } // Exiting paste mode, now interpreting. scala> org.apache.spark.ml.attribute.Test.main(null) {code} > Out of memory when calling TreeNode.toJSON > -- > > Key: SPARK-17356 > URL: https://issues.apache.org/jira/browse/SPARK-17356 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > Attachments: jmap.txt, jstack.txt, queryplan.txt > > > When using MLLib, when calling toJSON on a plan with many level of > sub-queries, it may cause out of memory exception with stack trace like this > {code} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.mutable.AbstractSeq.(Seq.scala:47) > at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48) > at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46) > at scala.collection.immutable.List$.newBuilder(List.scala:396) > at > scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64) > at > scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:262) > at scala.collection.AbstractTraversable.filter(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274) > at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7) > at > com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) > at > com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) > at > com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338) > at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34) > at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566) > {code} > The query plan, stack trace, and jmap distribution is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17356) Out of memory when calling TreeNode.toJSON
[ https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454495#comment-15454495 ] Sean Zhong edited comment on SPARK-17356 at 9/1/16 6:38 AM: *Root cause:* 1. MLLib heavily leverage MetaData to store a lot of attribute information, in the case here, the metadata may contains tens of thousands of Attribute information. And the meta data may be stored to Alias expression like this: {code} case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) {code} If we serialize the meta data to JSON, it will take a huge amount of memory. was (Author: clockfly): Root cause: 1. MLLib heavily leverage MetaData to store a lot of attribute information, in the case here, the metadata may contains tens of thousands of Attribute information. And the meta data may be stored to Alias expression like this: {code} case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) {code} If we serialize the meta data to JSON, it will take a huge amount of memory. > Out of memory when calling TreeNode.toJSON > -- > > Key: SPARK-17356 > URL: https://issues.apache.org/jira/browse/SPARK-17356 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > Attachments: jmap.txt, jstack.txt, queryplan.txt > > > When using MLLib, when calling toJSON on a plan with many level of > sub-queries, it may cause out of memory exception with stack trace like this > {code} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.mutable.AbstractSeq.(Seq.scala:47) > at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48) > at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46) > at scala.collection.immutable.List$.newBuilder(List.scala:396) > at > scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64) > at > scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:262) > at scala.collection.AbstractTraversable.filter(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274) > at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7) > at > com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) > at > com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) > at > com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338) > at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34) > at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566) > {code} > The query plan, stack trace, and jmap distribution is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17356) Out of memory when calling TreeNode.toJSON
[ https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454495#comment-15454495 ] Sean Zhong commented on SPARK-17356: Root cause: 1. MLLib heavily leverage MetaData to store a lot of attribute information, in the case here, the metadata may contains tens of thousands of Attribute information. And the meta data may be stored to Alias expression like this: {code} case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) {code} If we serialize the meta data to JSON, it will take a huge amount of memory. > Out of memory when calling TreeNode.toJSON > -- > > Key: SPARK-17356 > URL: https://issues.apache.org/jira/browse/SPARK-17356 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > Attachments: jmap.txt, jstack.txt, queryplan.txt > > > When using MLLib, when calling toJSON on a plan with many level of > sub-queries, it may cause out of memory exception with stack trace like this > {code} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.mutable.AbstractSeq.(Seq.scala:47) > at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48) > at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46) > at scala.collection.immutable.List$.newBuilder(List.scala:396) > at > scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64) > at > scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:262) > at scala.collection.AbstractTraversable.filter(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274) > at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7) > at > com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) > at > com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) > at > com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338) > at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34) > at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566) > {code} > The query plan, stack trace, and jmap distribution is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17356) Out of memory when calling TreeNode.toJSON
[ https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454488#comment-15454488 ] Sean Zhong commented on SPARK-17356: *Analysis* After looking at the mmap, there is a suspicious line {code} 20: 18388624 [Lorg.apache.spark.ml.attribute.Attribute; {code} This means a single Attribute array takes more than 8388624 bytes, and if each reference takes 8 bytes, it means there are 1 million attributes. The array probably is used in AttributeGroup, whose signature is: {code} class AttributeGroup private ( val name: String, val numAttributes: Option[Int], attrs: Option[Array[Attribute]]) {code} And, in AttributeGroup, there is a toMetaData function which will convert the Attribute array to Meta data {code} def toMetadata(): Metadata = toMetadata(Metadata.empty) {code} Finally, the metadata are saved to expression Attribute. For example, in org.apache.spark.ml.feature.Interaction transform function, the meta data is set to attribute of Alias expression, when aliasing the udf function like this: {code} override def transform(dataset: Dataset[_]): DataFrame = { ... // !NOTE!: This is an attribute group val featureAttrs = getFeatureAttrs(inputFeatures) def interactFunc = udf { row: Row => ... } val featureCols = inputFeatures.map { f => f.dataType match { case DoubleType => dataset(f.name) case _: VectorUDT => dataset(f.name) case _: NumericType | BooleanType => dataset(f.name).cast(DoubleType) } } // !NOTE!: The meta data i stored in Alias expresion by function call .as(..., featureAttrs.toMetadata()) dataset.select( col("*"), interactFunc(struct(featureCols: _*)).as($(outputCol), featureAttrs.toMetadata())) } {code} And, when calling toJSON, the metaData will be converted to JSON. > Out of memory when calling TreeNode.toJSON > -- > > Key: SPARK-17356 > URL: https://issues.apache.org/jira/browse/SPARK-17356 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > Attachments: jmap.txt, jstack.txt, queryplan.txt > > > When using MLLib, when calling toJSON on a plan with many level of > sub-queries, it may cause out of memory exception with stack trace like this > {code} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.mutable.AbstractSeq.(Seq.scala:47) > at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48) > at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46) > at scala.collection.immutable.List$.newBuilder(List.scala:396) > at > scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64) > at > scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:262) > at scala.collection.AbstractTraversable.filter(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274) > at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7) > at > com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) > at > com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) > at > com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338) > at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34) > at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566) > {code} > The query plan, stack trace, and jmap distribution is attached. -- This message was sent
[jira] [Updated] (SPARK-17356) Out of memory when calling TreeNode.toJSON
[ https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17356: --- Description: When using MLLib, when calling toJSON on a plan with many level of sub-queries, it may cause out of memory exception with stack trace like this {code} java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.mutable.AbstractSeq.(Seq.scala:47) at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48) at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46) at scala.collection.immutable.List$.newBuilder(List.scala:396) at scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64) at scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:262) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274) at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7) at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338) at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34) at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566) {code} The query plan, stack trace, and jmap distribution is attached. was: When using MLLib, TreeNode.toJSON may cause out of memory exception with stack trace like this {code} java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.mutable.AbstractSeq.(Seq.scala:47) at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48) at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46) at scala.collection.immutable.List$.newBuilder(List.scala:396) at scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64) at scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:262) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274) at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7) at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) at
[jira] [Updated] (SPARK-17356) Out of memory when calling TreeNode.toJSON
[ https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17356: --- Attachment: jstack.txt > Out of memory when calling TreeNode.toJSON > -- > > Key: SPARK-17356 > URL: https://issues.apache.org/jira/browse/SPARK-17356 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > Attachments: jmap.txt, jstack.txt, queryplan.txt > > > When using MLLib, TreeNode.toJSON may cause out of memory exception with > stack trace like this > {code} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.mutable.AbstractSeq.(Seq.scala:47) > at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48) > at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46) > at scala.collection.immutable.List$.newBuilder(List.scala:396) > at > scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64) > at > scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:262) > at scala.collection.AbstractTraversable.filter(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274) > at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7) > at > com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) > at > com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) > at > com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338) > at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34) > at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17356) Out of memory when calling TreeNode.toJSON
[ https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17356: --- Attachment: jmap.txt > Out of memory when calling TreeNode.toJSON > -- > > Key: SPARK-17356 > URL: https://issues.apache.org/jira/browse/SPARK-17356 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > Attachments: jmap.txt, queryplan.txt > > > When using MLLib, TreeNode.toJSON may cause out of memory exception with > stack trace like this > {code} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.mutable.AbstractSeq.(Seq.scala:47) > at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48) > at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46) > at scala.collection.immutable.List$.newBuilder(List.scala:396) > at > scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64) > at > scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:262) > at scala.collection.AbstractTraversable.filter(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274) > at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7) > at > com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) > at > com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) > at > com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338) > at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34) > at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17356) Out of memory when calling TreeNode.toJSON
[ https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17356: --- Attachment: queryplan.txt > Out of memory when calling TreeNode.toJSON > -- > > Key: SPARK-17356 > URL: https://issues.apache.org/jira/browse/SPARK-17356 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > Attachments: queryplan.txt > > > When using MLLib, TreeNode.toJSON may cause out of memory exception with > stack trace like this > {code} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.mutable.AbstractSeq.(Seq.scala:47) > at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48) > at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46) > at scala.collection.immutable.List$.newBuilder(List.scala:396) > at > scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64) > at > scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:262) > at scala.collection.AbstractTraversable.filter(Traversable.scala:105) > at > scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274) > at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) > at > org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7) > at > com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) > at > com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) > at > com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338) > at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34) > at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17356) Out of memory when calling TreeNode.toJSON
Sean Zhong created SPARK-17356: -- Summary: Out of memory when calling TreeNode.toJSON Key: SPARK-17356 URL: https://issues.apache.org/jira/browse/SPARK-17356 Project: Spark Issue Type: Bug Components: SQL Reporter: Sean Zhong When using MLLib, TreeNode.toJSON may cause out of memory exception with stack trace like this {code} java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.mutable.AbstractSeq.(Seq.scala:47) at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48) at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46) at scala.collection.immutable.List$.newBuilder(List.scala:396) at scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64) at scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:262) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274) at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20) at org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7) at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338) at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34) at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17306) Memory leak in QuantileSummaries
[ https://issues.apache.org/jira/browse/SPARK-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17306: --- Component/s: SQL > Memory leak in QuantileSummaries > > > Key: SPARK-17306 > URL: https://issues.apache.org/jira/browse/SPARK-17306 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > compressThreshold was not referenced anywhere > {code} > class QuantileSummaries( > val compressThreshold: Int, > val relativeError: Double, > val sampled: ArrayBuffer[Stats] = ArrayBuffer.empty, > private[stat] var count: Long = 0L, > val headSampled: ArrayBuffer[Double] = ArrayBuffer.empty) extends > Serializable > {code} > And, it causes memory leak, QuantileSummaries takes unbounded memory > {code} > val summary = new QuantileSummaries(1, relativeError = 0.001) > // Results in creating an array of size 1 !!! > (1 to 1).foreach(summary.insert(_)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17306) Memory leak in QuantileSummaries
Sean Zhong created SPARK-17306: -- Summary: Memory leak in QuantileSummaries Key: SPARK-17306 URL: https://issues.apache.org/jira/browse/SPARK-17306 Project: Spark Issue Type: Bug Reporter: Sean Zhong compressThreshold was not referenced anywhere {code} class QuantileSummaries( val compressThreshold: Int, val relativeError: Double, val sampled: ArrayBuffer[Stats] = ArrayBuffer.empty, private[stat] var count: Long = 0L, val headSampled: ArrayBuffer[Double] = ArrayBuffer.empty) extends Serializable {code} And, it causes memory leak, QuantileSummaries takes unbounded memory {code} val summary = new QuantileSummaries(1, relativeError = 0.001) // Results in creating an array of size 1 !!! (1 to 1).foreach(summary.insert(_)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
[ https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17289: --- Description: For the following query: {code} val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", "b").createOrReplaceTempView("t2") spark.sql("select max(b) from t2 group by a").explain(true) {code} Now, the SortAggregator won't insert Sort operator before partial aggregation, this will break sort-based partial aggregation. {code} == Physical Plan == SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) +- *Sort [a#5 ASC], false, 0 +- Exchange hashpartitioning(a#5, 200) +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19]) +- LocalTableScan [a#5, b#6] {code} In Spark 2.0 branch, the plan is: {code} == Physical Plan == SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) +- *Sort [a#5 ASC], false, 0 +- Exchange hashpartitioning(a#5, 200) +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19]) +- *Sort [a#5 ASC], false, 0 +- LocalTableScan [a#5, b#6] {code} This is related to SPARK-12978 was: For the following query: {code} val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", "b").createOrReplaceTempView("t2") spark.sql("select max(b) from t2 group by a").explain(true) {code} Now, the SortAggregator won't insert Sort operator before partial aggregation, this will break sort-based partial aggregation. {code} == Physical Plan == SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) +- *Sort [a#5 ASC], false, 0 +- Exchange hashpartitioning(a#5, 200) +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19]) +- LocalTableScan [a#5, b#6] {code} In Spark 2.0 branch, the plan is: {code} == Physical Plan == SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) +- *Sort [a#5 ASC], false, 0 +- Exchange hashpartitioning(a#5, 200) +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19]) +- *Sort [a#5 ASC], false, 0 +- LocalTableScan [a#5, b#6] {code} This is related with SPARK-12978 > Sort based partial aggregation breaks due to SPARK-12978 > > > Key: SPARK-17289 > URL: https://issues.apache.org/jira/browse/SPARK-17289 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong > > For the following query: > {code} > val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", > "b").createOrReplaceTempView("t2") > spark.sql("select max(b) from t2 group by a").explain(true) > {code} > Now, the SortAggregator won't insert Sort operator before partial > aggregation, this will break sort-based partial aggregation. > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- LocalTableScan [a#5, b#6] > {code} > In Spark 2.0 branch, the plan is: > {code} > == Physical Plan == > SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) > +- *Sort [a#5 ASC], false, 0 >+- Exchange hashpartitioning(a#5, 200) > +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, > max#19]) > +- *Sort [a#5 ASC], false, 0 > +- LocalTableScan [a#5, b#6] > {code} > This is related to SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978
Sean Zhong created SPARK-17289: -- Summary: Sort based partial aggregation breaks due to SPARK-12978 Key: SPARK-17289 URL: https://issues.apache.org/jira/browse/SPARK-17289 Project: Spark Issue Type: Bug Reporter: Sean Zhong For the following query: {code} val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", "b").createOrReplaceTempView("t2") spark.sql("select max(b) from t2 group by a").explain(true) {code} Now, the SortAggregator won't insert Sort operator before partial aggregation, this will break sort-based partial aggregation. {code} == Physical Plan == SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) +- *Sort [a#5 ASC], false, 0 +- Exchange hashpartitioning(a#5, 200) +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19]) +- LocalTableScan [a#5, b#6] {code} In Spark 2.0 branch, the plan is: {code} == Physical Plan == SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) +- *Sort [a#5 ASC], false, 0 +- Exchange hashpartitioning(a#5, 200) +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19]) +- *Sort [a#5 ASC], false, 0 +- LocalTableScan [a#5, b#6] {code} This is related with SPARK-12978 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used
[ https://issues.apache.org/jira/browse/SPARK-17189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17189: --- Component/s: SQL > [MINOR] Looses the interface from UnsafeRow to InternalRow in > AggregationIterator if UnsafeRow specific method is not used > -- > > Key: SPARK-17189 > URL: https://issues.apache.org/jira/browse/SPARK-17189 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used
Sean Zhong created SPARK-17189: -- Summary: [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used Key: SPARK-17189 URL: https://issues.apache.org/jira/browse/SPARK-17189 Project: Spark Issue Type: Bug Reporter: Sean Zhong Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431136#comment-15431136 ] Sean Zhong edited comment on SPARK-16283 at 8/22/16 4:35 PM: - Created a sub-task SPARK-17188 to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project was (Author: clockfly): Created a sub-task to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431136#comment-15431136 ] Sean Zhong commented on SPARK-16283: Created a sub-task to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17188: --- Description: QuantileSummaries is a useful utility class to do statistics. It can be used by aggregation function like percentile_approx. Currently, QuantileSummaries is located at project catalyst in package org.apache.spark.sql.execution.stat, probably, we should move it to project catalyst package org.apache.spark.sql.util. was:org.apache.spark.sql.execution.stat > Moves QuantileSummaries to project catalyst from sql so that it can be used > to implement percentile_approx > -- > > Key: SPARK-17188 > URL: https://issues.apache.org/jira/browse/SPARK-17188 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sean Zhong > > QuantileSummaries is a useful utility class to do statistics. It can be used > by aggregation function like percentile_approx. > Currently, QuantileSummaries is located at project catalyst in package > org.apache.spark.sql.execution.stat, probably, we should move it to project > catalyst package org.apache.spark.sql.util. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
Sean Zhong created SPARK-17188: -- Summary: Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx Key: SPARK-17188 URL: https://issues.apache.org/jira/browse/SPARK-17188 Project: Spark Issue Type: Sub-task Reporter: Sean Zhong -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17188: --- Description: org.apache.spark.sql.execution.stat > Moves QuantileSummaries to project catalyst from sql so that it can be used > to implement percentile_approx > -- > > Key: SPARK-17188 > URL: https://issues.apache.org/jira/browse/SPARK-17188 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sean Zhong > > org.apache.spark.sql.execution.stat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
[ https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17187: --- Description: *Background* For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer. *Problem* Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like: 1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types. 2. It is hard to reuse aggregation class definition defined in existing libraries like algebird. 3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge. *Proposal* We propose: 1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like: - It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird. - We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee theperformance. 2. Refactors `TypedAggregateExpression` to use this new interface, to get higher performance. 3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface. was: *Background* For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer. *Problem* Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like: 1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types. 2. It is hard to reuse aggregation class definition defined in existing libraries like algebird. 3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge. *Proposal* We propose: 1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like: a. It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird. b. We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee the performance. 2. Refactors `TypedAggregateExpression` to use this new interface, to get higher performance. 3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface. > Support using arbitrary Java object as internal aggregation buffer object > - > > Key: SPARK-17187 > URL: https://issues.apache.org/jira/browse/SPARK-17187 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sean Zhong > > *Background* > For aggregation functions like sum and count, Spark-Sql internally use an > aggregation buffer to store the intermediate aggregation result for all > aggregation functions. Each aggregation function will occupy a section of > aggregation buffer. > *Problem* > Currently, Spark-sql only allows a small set of Spark-Sql supported storage > data types stored in aggregation buffer, which is not very convenient or > performant, there are several typical cases like: > 1. If the aggregation has a complex model CountMinSketch, it is not very easy > to convert the complex model so that it can be stored with limited Spark-sql > supported data types. > 2. It is hard to reuse aggregation class definition defined in existing > libraries like algebird. > 3. It may introduces heavy serialization/deserialization cost
[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
[ https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17187: --- Description: *Background* For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer. *Problem* Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like: 1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types. 2. It is hard to reuse aggregation class definition defined in existing libraries like algebird. 3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge. *Proposal* We propose: 1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like: - It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird. - We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee theperformance. 2. Refactors `TypedAggregateExpression` to use this new interface, to get higher performance. 3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface. was: *Background* For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer. *Problem* Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like: 1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types. 2. It is hard to reuse aggregation class definition defined in existing libraries like algebird. 3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge. *Proposal* We propose: 1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like: - It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird. - We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee theperformance. 2. Refactors `TypedAggregateExpression` to use this new interface, to get higher performance. 3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface. > Support using arbitrary Java object as internal aggregation buffer object > - > > Key: SPARK-17187 > URL: https://issues.apache.org/jira/browse/SPARK-17187 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sean Zhong > > *Background* > For aggregation functions like sum and count, Spark-Sql internally use an > aggregation buffer to store the intermediate aggregation result for all > aggregation functions. Each aggregation function will occupy a section of > aggregation buffer. > *Problem* > Currently, Spark-sql only allows a small set of Spark-Sql supported storage > data types stored in aggregation buffer, which is not very convenient or > performant, there are several typical cases like: > 1. If the aggregation has a complex model CountMinSketch, it is not very easy > to convert the complex model so that it can be stored with limited Spark-sql > supported data types. > 2. It is hard to reuse aggregation class definition defined in existing > libraries like algebird. > 3. It may introduces heavy
[jira] [Created] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
Sean Zhong created SPARK-17187: -- Summary: Support using arbitrary Java object as internal aggregation buffer object Key: SPARK-17187 URL: https://issues.apache.org/jira/browse/SPARK-17187 Project: Spark Issue Type: New Feature Components: SQL Reporter: Sean Zhong *Background* For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer. *Problem* Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like: 1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types. 2. It is hard to reuse aggregation class definition defined in existing libraries like algebird. 3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge. *Proposal* We propose: 1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like: a. It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird. b. We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee the performance. 2. Refactors `TypedAggregateExpression` to use this new interface, to get higher performance. 3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression
Sean Zhong created SPARK-17034: -- Summary: Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression Key: SPARK-17034 URL: https://issues.apache.org/jira/browse/SPARK-17034 Project: Spark Issue Type: Bug Reporter: Sean Zhong Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" should be considered as unresolved before analysis. But in current code, it uses "Literal" expression to store the ordinal. This is inappropriate as "Literal" itself is a resolved expression, it gives the user a wrong message that the ordinals has already been resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16666) Kryo encoder for custom complex classes
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408855#comment-15408855 ] Sean Zhong commented on SPARK-1: This issue has been fixed in Spark 2.0 and trunk. Can you use latest code instead? {code} scala> import org.apache.spark.sql.{Encoder, Encoders, SQLContext} scala> import org.apache.spark.{SparkConf, SparkContext} scala> case class Point(a: Int, b: Int) scala> case class Foo(position: Point, name: String) scala> implicit val PointEncoder: Encoder[Point] = Encoders.kryo[Point] scala> implicit val FooEncoder: Encoder[Foo] = Encoders.kryo[Foo] scala> Seq(new Foo(new Point(0, 0), "bar")).toDS.show ++ | value| ++ |[01 00 24 6C 69 6...| ++ {code} > Kryo encoder for custom complex classes > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.6.2 >Reporter: Sam > > I'm trying to create a dataset with some geo data using spark and esri. If > `Foo` only have `Point` field, it'll work but if I add some other fields > beyond a `Point`, I get ArrayIndexOutOfBoundsException. > {code:scala} > import com.esri.core.geometry.Point > import org.apache.spark.sql.{Encoder, Encoders, SQLContext} > import org.apache.spark.{SparkConf, SparkContext} > > object Main { > > case class Foo(position: Point, name: String) > > object MyEncoders { > implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point] > > implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo] > } > > def main(args: Array[String]): Unit = { > val sc = new SparkContext(new > SparkConf().setAppName("app").setMaster("local")) > val sqlContext = new SQLContext(sc) > import MyEncoders.{FooEncoder, PointEncoder} > import sqlContext.implicits._ > Seq(new Foo(new Point(0, 0), "bar")).toDS.show > } > } > {code} > {noformat} > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71) > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70) > > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70) > > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69) > > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) > at > org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) > > at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) > at org.apache.spark.sql.Dataset.show(Dataset.scala:230) > at org.apache.spark.sql.Dataset.show(Dataset.scala:193) > at org.apache.spark.sql.Dataset.show(Dataset.scala:201) > at Main$.main(Main.scala:24) > at Main.main(Main.scala) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16666) Kryo encoder for custom complex classes
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong resolved SPARK-1. Resolution: Not A Problem > Kryo encoder for custom complex classes > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.6.2 >Reporter: Sam > > I'm trying to create a dataset with some geo data using spark and esri. If > `Foo` only have `Point` field, it'll work but if I add some other fields > beyond a `Point`, I get ArrayIndexOutOfBoundsException. > {code:scala} > import com.esri.core.geometry.Point > import org.apache.spark.sql.{Encoder, Encoders, SQLContext} > import org.apache.spark.{SparkConf, SparkContext} > > object Main { > > case class Foo(position: Point, name: String) > > object MyEncoders { > implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point] > > implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo] > } > > def main(args: Array[String]): Unit = { > val sc = new SparkContext(new > SparkConf().setAppName("app").setMaster("local")) > val sqlContext = new SQLContext(sc) > import MyEncoders.{FooEncoder, PointEncoder} > import sqlContext.implicits._ > Seq(new Foo(new Point(0, 0), "bar")).toDS.show > } > } > {code} > {noformat} > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71) > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70) > > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70) > > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69) > > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) > at > org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) > > at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) > at org.apache.spark.sql.Dataset.show(Dataset.scala:230) > at org.apache.spark.sql.Dataset.show(Dataset.scala:193) > at org.apache.spark.sql.Dataset.show(Dataset.scala:201) > at Main$.main(Main.scala:24) > at Main.main(Main.scala) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16907) Parquet table reading performance regression when vectorized record reader is not used
Sean Zhong created SPARK-16907: -- Summary: Parquet table reading performance regression when vectorized record reader is not used Key: SPARK-16907 URL: https://issues.apache.org/jira/browse/SPARK-16907 Project: Spark Issue Type: Bug Components: SQL Reporter: Sean Zhong For this parquet reading benchmark, Spark 2.0 is 20%-30% slower than Spark 1.6. {code} // Test Env: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, Intel SSD SC2KW24 // Generates parquet table with nested columns spark.range(1).select(struct($"id").as("nc")).write.parquet("/tmp/data4") def time[R](block: => R): Long = { val t0 = System.nanoTime() val result = block// call-by-name val t1 = System.nanoTime() println("Elapsed time: " + (t1 - t0)/100 + "ms") (t1 - t0)/100 } val x = ((0 until 20).toList.map(x => time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect(.sum/20 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16906) Adds more input type information for TypedAggregateExpression
Sean Zhong created SPARK-16906: -- Summary: Adds more input type information for TypedAggregateExpression Key: SPARK-16906 URL: https://issues.apache.org/jira/browse/SPARK-16906 Project: Spark Issue Type: Bug Reporter: Sean Zhong Priority: Minor For TypedAggregateExpression {code} case class TypedAggregateExpression( aggregator: Aggregator[Any, Any, Any], inputDeserializer: Option[Expression], bufferSerializer: Seq[NamedExpression], bufferDeserializer: Expression, outputSerializer: Seq[Expression], outputExternalType: DataType, dataType: DataType, nullable: Boolean) extends DeclarativeAggregate with NonSQLExpression {code} Aggregator of TypedAggregateExpression usually contains a closure like: {code} class TypedSumDouble[IN](f: IN => Double) extends Aggregator[IN, Double, Double] {code} It will be great if we can add more info in TypedAggregateExpression to describe the closure input type {{IN}}, like class, and schema, so that we can use this info to make customized optimization like SPARK-14083 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16898) Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn
[ https://issues.apache.org/jira/browse/SPARK-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-16898: --- Summary: Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn (was: Adds argument type information for typed logical plan likMapElements, TypedFilter, and AppendColumn) > Adds argument type information for typed logical plan like MapElements, > TypedFilter, and AppendColumn > - > > Key: SPARK-16898 > URL: https://issues.apache.org/jira/browse/SPARK-16898 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Minor > > Typed logical plan like MapElements, TypedFilter, and AppendColumn contains a > closure field: {{func: (T) => Boolean}}. For example class TypedFilter's > signature is: > {code} > case class TypedFilter( > func: AnyRef, > deserializer: Expression, > child: LogicalPlan) extends UnaryNode > {code} > From the above class signature, we cannot easily find: > 1. What is the input argument's type of the closure {{func}}? How do we know > which apply method to pick if there are multiple overloaded apply methods? > 2. What is the input argument's schema? > With this info, it is easier for us to define some custom optimizer rule to > translate these typed logical plan to more efficient implementation, like the > closure optimization idea in SPARK-14083. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16898) Adds argument type information for typed logical plan likMapElements, TypedFilter, and AppendColumn
Sean Zhong created SPARK-16898: -- Summary: Adds argument type information for typed logical plan likMapElements, TypedFilter, and AppendColumn Key: SPARK-16898 URL: https://issues.apache.org/jira/browse/SPARK-16898 Project: Spark Issue Type: Bug Reporter: Sean Zhong Priority: Minor Typed logical plan like MapElements, TypedFilter, and AppendColumn contains a closure field: {{func: (T) => Boolean}}. For example class TypedFilter's signature is: {code} case class TypedFilter( func: AnyRef, deserializer: Expression, child: LogicalPlan) extends UnaryNode {code} >From the above class signature, we cannot easily find: 1. What is the input argument's type of the closure {{func}}? How do we know which apply method to pick if there are multiple overloaded apply methods? 2. What is the input argument's schema? With this info, it is easier for us to define some custom optimizer rule to translate these typed logical plan to more efficient implementation, like the closure optimization idea in SPARK-14083. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16888) Implements eval method for expression AssertNotNull
Sean Zhong created SPARK-16888: -- Summary: Implements eval method for expression AssertNotNull Key: SPARK-16888 URL: https://issues.apache.org/jira/browse/SPARK-16888 Project: Spark Issue Type: Improvement Reporter: Sean Zhong Priority: Minor We should support eval() method for expression AssertNotNull Currently, it reports UnsupportedOperationException when used in projection of LocalRelation. {code} scala> import org.apache.spark.sql.catalyst.dsl.expressions._ scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull scala> import org.apache.spark.sql.Column scala> case class A(a: Int) scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain java.lang.UnsupportedOperationException: Only code-generated evaluation is supported. at org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong closed SPARK-16841. -- Resolution: Not A Problem > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306 ] Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM: This jira is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. was (Author: clockfly): This jira is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306 ] Sean Zhong commented on SPARK-16841: This PR is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306 ] Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM: This jira is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. was (Author: clockfly): This PR is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405057#comment-15405057 ] Sean Zhong commented on SPARK-16320: [~maver1ck] Did you use the test case in this jira {code} select count(*) where nested_column.id > some_id {code} Or the test case in SPARK-16321? {code} df = sqlctx.read.parquet(path) df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 else []).collect() {code} There is a slight difference. In the first test case, the filter condition uses nested fields, the filter push down is not supported. So, your PR 14465 should not be effective. Basically SPARK-16320 and SPARK-16321 are different problems. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16853) Analysis error for DataSet typed selection
Sean Zhong created SPARK-16853: -- Summary: Analysis error for DataSet typed selection Key: SPARK-16853 URL: https://issues.apache.org/jira/browse/SPARK-16853 Project: Spark Issue Type: Bug Reporter: Sean Zhong For DataSet typed selection {code} def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1] {code} If U1 contains sub-fields, then it reports AnalysisException Reproducer: {code} scala> case class A(a: Int, b: Int) scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]) org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input columns: [_2]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403238#comment-15403238 ] Sean Zhong edited comment on SPARK-16320 at 8/2/16 2:22 AM: [~maver1ck] Can you check whether the PR works for you? was (Author: clockfly): [~loziniak] Can you check whether the PR works for you? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403238#comment-15403238 ] Sean Zhong commented on SPARK-16320: [~loziniak] Can you check whether the PR works for you? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-16841: --- Summary: Improves the row level metrics performance when reading Parquet table (was: Improve the row level metrics performance when reading Parquet table) > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16841) Improve the row level metrics performance when reading Parquet table
Sean Zhong created SPARK-16841: -- Summary: Improve the row level metrics performance when reading Parquet table Key: SPARK-16841 URL: https://issues.apache.org/jira/browse/SPARK-16841 Project: Spark Issue Type: Improvement Reporter: Sean Zhong When reading Parquet table, Spark adds row level metrics like recordsRead, bytesRead (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). The implementation is not very efficient. When parquet vectorized reader is not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12437) Reserved words (like table) throws error when writing a data frame to JDBC
[ https://issues.apache.org/jira/browse/SPARK-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong resolved SPARK-12437. Resolution: Duplicate > Reserved words (like table) throws error when writing a data frame to JDBC > -- > > Key: SPARK-12437 > URL: https://issues.apache.org/jira/browse/SPARK-12437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > Labels: starter > > From: A Spark user > If you have a DataFrame column name that contains a SQL reserved word, it > will not write to a JDBC source. This is somewhat similar to an error found > in the redshift adapter: > https://github.com/databricks/spark-redshift/issues/80 > I have produced this on a MySQL (AWS Aurora) database > Steps to reproduce: > {code} > val connectionProperties = new java.util.Properties() > sqlContext.table("diamonds").write.jdbc(jdbcUrl, "diamonds", > connectionProperties) > {code} > Exception: > {code} > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'table DOUBLE PRECISION , price > INTEGER , x DOUBLE PRECISION , y DOUBLE PRECISION' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at com.mysql.jdbc.Util.handleNewInstance(Util.java:411) > at com.mysql.jdbc.Util.getInstance(Util.java:386) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1054) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4237) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4169) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2617) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2778) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2825) > at > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2156) > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2459) > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2376) > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2360) > at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:275) > {code} > You can workaround this by renaming the column on the dataframe before > writing, but ideally we should be able to do something like encapsulate the > name in quotes which is allowed. Example: > {code} > CREATE TABLE `test_table_column` ( > `id` int(11) DEFAULT NULL, > `table` varchar(100) DEFAULT NULL > ) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12437) Reserved words (like table) throws error when writing a data frame to JDBC
[ https://issues.apache.org/jira/browse/SPARK-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385148#comment-15385148 ] Sean Zhong commented on SPARK-12437: This issue is fixed by SPARK-16387. The column names are quoted automatically. The jdbc table name can be quoted manually by the end user like this: spark.table("good2").write.jdbc(url, "`table`") The following example can run successfully: {code} // Here we have a special column name "table" Seq((1,2),(2,3)).toDF("table", "b").createOrReplaceTempView("good2") val connectionProperties = new java.util.Properties() val url = "jdbc:mysql://127.0.0.1:3306/test?password=xx=root=false" // Here we have a special table name "table" spark.table("good2").write.jdbc(url, "`table`", connectionProperties) {code} > Reserved words (like table) throws error when writing a data frame to JDBC > -- > > Key: SPARK-12437 > URL: https://issues.apache.org/jira/browse/SPARK-12437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > Labels: starter > > From: A Spark user > If you have a DataFrame column name that contains a SQL reserved word, it > will not write to a JDBC source. This is somewhat similar to an error found > in the redshift adapter: > https://github.com/databricks/spark-redshift/issues/80 > I have produced this on a MySQL (AWS Aurora) database > Steps to reproduce: > {code} > val connectionProperties = new java.util.Properties() > sqlContext.table("diamonds").write.jdbc(jdbcUrl, "diamonds", > connectionProperties) > {code} > Exception: > {code} > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'table DOUBLE PRECISION , price > INTEGER , x DOUBLE PRECISION , y DOUBLE PRECISION' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at com.mysql.jdbc.Util.handleNewInstance(Util.java:411) > at com.mysql.jdbc.Util.getInstance(Util.java:386) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1054) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4237) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4169) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2617) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2778) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2825) > at > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2156) > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2459) > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2376) > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2360) > at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:275) > {code} > You can workaround this by renaming the column on the dataframe before > writing, but ideally we should be able to do something like encapsulate the > name in quotes which is allowed. Example: > {code} > CREATE TABLE `test_table_column` ( > `id` int(11) DEFAULT NULL, > `table` varchar(100) DEFAULT NULL > ) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16323) Avoid unnecessary cast when doing integral divide
Sean Zhong created SPARK-16323: -- Summary: Avoid unnecessary cast when doing integral divide Key: SPARK-16323 URL: https://issues.apache.org/jira/browse/SPARK-16323 Project: Spark Issue Type: Improvement Components: SQL Reporter: Sean Zhong Priority: Minor This is a follow up of issue SPARK-15776 *Problem:* For Integer divide operator div: {code} scala> spark.sql("select 6 div 3").explain(true) ... == Analyzed Logical Plan == CAST((6 / 3) AS BIGINT): bigint Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 3) AS BIGINT)#5L] +- OneRowRelation$ ... {code} For performance reason, we should not do unnecessary cast {{cast(xx as double)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16322) Avoids unnecessary cast when doing integral divide
Sean Zhong created SPARK-16322: -- Summary: Avoids unnecessary cast when doing integral divide Key: SPARK-16322 URL: https://issues.apache.org/jira/browse/SPARK-16322 Project: Spark Issue Type: Improvement Components: SQL Reporter: Sean Zhong Priority: Minor This is a follow up of https://databricks.atlassian.net/browse/SC-3588 *Problem:* For Integer divide operator {{div}}: {code} scala> spark.sql("select 6 div 3").explain(true) ... == Analyzed Logical Plan == CAST((6 / 3) AS BIGINT): bigint Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 3) AS BIGINT)#5L] +- OneRowRelation$ ... {code} For performance reason, we should not do unnecessary {{cast(xx as double)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353969#comment-15353969 ] Sean Zhong edited comment on SPARK-14083 at 6/29/16 12:17 AM: -- For typed operation like map, it will first de-serialize InternalRow to type T, apply the operation, and then serialize T back to InternalRow. For un-typed operation like select("column"), it directly operates on InternalRow. If end user defines a custom serializer like Kryo, then it is not possible to map typed operation to un-typed operation. {code} scala> case class C(c: Int) scala> val ds: Dataset[C] = Seq(C(1)).toDS scala> ds.select("c") // <- Return correct result when using default encoder. res1: org.apache.spark.sql.DataFrame = [c: int] scala> implicit val encoder: Encoder[C] = Encoders.kryo[C] // <- Define a Kryo encoder scala> val ds2: Dataset[C] = Seq(C(1)).toDS ds2: org.apache.spark.sql.Dataset[C] = [value: binary] // <- Row is encoded as binary by using Kryo encoder scala> ds2.select("c") // <- Fails even if "c" is an existing field in class C! org.apache.spark.sql.AnalysisException: cannot resolve '`c`' given input columns: [value]; ... {code} was (Author: clockfly): For typed operation like map, it will first de-serialize InternalRow to type T, apply the operation, and then serialize T back to InternalRow. For un-typed operation like select("column"), it directly operates on InternalRow. If end user defines a custom serializer like Kryo, then it is not possible to map typed operation to un-typed operation. ``` scala> case class C(c: Int) scala> val ds: Dataset[C] = Seq(C(1)).toDS scala> ds.select("c") // <- Return correct result when using default encoder. res1: org.apache.spark.sql.DataFrame = [c: int] scala> implicit val encoder: Encoder[C] = Encoders.kryo[C] // <- Define a Kryo encoder scala> val ds2: Dataset[C] = Seq(C(1)).toDS ds2: org.apache.spark.sql.Dataset[C] = [value: binary] // <- Row is encoded as binary by using Kryo encoder scala> ds2.select("c") // <- Fails even if "c" is an existing field in class C! org.apache.spark.sql.AnalysisException: cannot resolve '`c`' given input columns: [value]; ... ``` > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353969#comment-15353969 ] Sean Zhong commented on SPARK-14083: For typed operation like map, it will first de-serialize InternalRow to type T, apply the operation, and then serialize T back to InternalRow. For un-typed operation like select("column"), it directly operates on InternalRow. If end user defines a custom serializer like Kryo, then it is not possible to map typed operation to un-typed operation. ``` scala> case class C(c: Int) scala> val ds: Dataset[C] = Seq(C(1)).toDS scala> ds.select("c") // <- Return correct result when using default encoder. res1: org.apache.spark.sql.DataFrame = [c: int] scala> implicit val encoder: Encoder[C] = Encoders.kryo[C] // <- Define a Kryo encoder scala> val ds2: Dataset[C] = Seq(C(1)).toDS ds2: org.apache.spark.sql.Dataset[C] = [value: binary] // <- Row is encoded as binary by using Kryo encoder scala> ds2.select("c") // <- Fails even if "c" is an existing field in class C! org.apache.spark.sql.AnalysisException: cannot resolve '`c`' given input columns: [value]; ... ``` > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-16034: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-16032 > Checks the partition columns when calling > dataFrame.write.mode("append").saveAsTable > > > Key: SPARK-16034 > URL: https://issues.apache.org/jira/browse/SPARK-16034 > Project: Spark > Issue Type: Sub-task >Reporter: Sean Zhong > > Suppose we have defined a partitioned table: > {code} > CREATE TABLE src (a INT, b INT, c INT) > USING PARQUET > PARTITIONED BY (a, b); > {code} > We should check the partition columns when appending DataFrame data to > existing table: > {code} > val df = Seq((1, 2, 3)).toDF("a", "b", "c") > df.write.partitionBy("b", "a").mode("append").saveAsTable("src") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
Sean Zhong created SPARK-16034: -- Summary: Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable Key: SPARK-16034 URL: https://issues.apache.org/jira/browse/SPARK-16034 Project: Spark Issue Type: Bug Reporter: Sean Zhong Suppose we have defined a partitioned table: {code} CREATE TABLE src (a INT, b INT, c INT) USING PARQUET PARTITIONED BY (a, b); {code} We should check the partition columns when appending DataFrame data to existing table: {code} val df = Seq((1, 2, 3)).toDF("a", "b", "c") df.write.partitionBy("b", "a").mode("append").saveAsTable("src") {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15340) Limit the size of the map used to cache JobConfs to void OOM
[ https://issues.apache.org/jira/browse/SPARK-15340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336511#comment-15336511 ] Sean Zhong commented on SPARK-15340: [~DoingDone9] I did some tests, and didn't see the OOM you observed. Can you elaborate more on how the OOM happens? 1. What is the error message and stack trace when the OOM happens? 2. Are you running Apache Spark in single machine mode or cluster mode? 3. What is the driver configuration settings? 4. Is it a Permgen (Metaspace) OOM or heap space OOM? 5. Which client are you using to connect to the thrift JDBC server? 6. Do you have a reproducible script so that we can try to reproduce it on our environment? 7. Is your Spark deployment running on YARN? > Limit the size of the map used to cache JobConfs to void OOM > > > Key: SPARK-15340 > URL: https://issues.apache.org/jira/browse/SPARK-15340 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Zhongshuai Pei >Priority: Critical > > when i run tpcds (orc) by using JDBCServer, driver always OOM. > i find tens of thousands of Jobconf from dump file and these JobConf can not > be recycled, So we should limit the size of the map used to cache JobConfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335174#comment-15335174 ] Sean Zhong commented on SPARK-14048: [~simeons] You can use {{sqlContext.sql("query")}} instead of {{%sql}} to get around this issue as a workaround. > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334743#comment-15334743 ] Sean Zhong commented on SPARK-15786: [~yhuai] Sure, we definitely can improve it. > joinWith bytecode generation calling ByteBuffer.wrap with InternalRow > - > > Key: SPARK-15786 > URL: https://issues.apache.org/jira/browse/SPARK-15786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Richard Marscher >Assignee: Sean Zhong > Fix For: 2.0.0 > > > {code}java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 36, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates > are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", > "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, > int)"{code} > I have been trying to use joinWith along with Option data types to get an > approximation of the RDD semantics for outer joins with Dataset to have a > nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode > generation trying to pass an InternalRow object into the ByteBuffer.wrap > function which expects byte[] with or without a couple int qualifiers. > I have a notebook reproducing this against 2.0 preview in Databricks > Community Edition: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334657#comment-15334657 ] Sean Zhong commented on SPARK-14048: [~simeons] I can now reproduce this on Databricks community edition by changing above notebook script to: {code} val rdd = sc.makeRDD( """{"st": {"x.y": 1}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 20}""" :: Nil) sqlContext.read.json(rdd).registerTempTable("test") %sql select first(st) as st from test group by age {code} Thanks! I will post the updates later. > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334644#comment-15334644 ] Sean Zhong commented on SPARK-14048: [~simeons] Can you share a complete notebook which we can run and reproduce the problem you saw? For example, the file {{include/init_scala}} is missed in your notebook. > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333221#comment-15333221 ] Sean Zhong commented on SPARK-14048: [~simeons] Can you try the following script on your environment and see if it reports error? {code} val rdd = sc.makeRDD( """{"st": {"x.y": 1}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 20}""" :: Nil) sqlContext.read.json(rdd).registerTempTable("test") sqlContext.sql("select first(st) as st from test group by age").show() {code} > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333205#comment-15333205 ] Sean Zhong commented on SPARK-15786: Hi [~rmarscher] The reason is that you use the Kryo encoder in a wrong way, you cannot cast {{Dataset\[A\]}} to {{Dataset\[B\]}} using a Kryo encoder. In Spark 2.0, we now have a stricter rule to detect this kind of cast failure after PR https://github.com/apache/spark/pull/13632. Now, it reports error like this: {code} scala> ds.as[(Option[(Int, Int)], Option[(Int, Int)])] org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`_1` AS BINARY)' due to data type mismatch: cannot cast StructType(StructField(_1,IntegerType,false), StructField(_2,IntegerType,false)) to BinaryType; {code} If you remove all related Kryo lines in your test code, then no exception should be thrown, your code should work fine. > joinWith bytecode generation calling ByteBuffer.wrap with InternalRow > - > > Key: SPARK-15786 > URL: https://issues.apache.org/jira/browse/SPARK-15786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Richard Marscher > > {code}java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 36, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates > are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", > "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, > int)"{code} > I have been trying to use joinWith along with Option data types to get an > approximation of the RDD semantics for outer joins with Dataset to have a > nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode > generation trying to pass an InternalRow object into the ByteBuffer.wrap > function which expects byte[] with or without a couple int qualifiers. > I have a notebook reproducing this against 2.0 preview in Databricks > Community Edition: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332775#comment-15332775 ] Sean Zhong commented on SPARK-14048: [~simeons] Are you able to reproduce this case any longer? I cannot reproduce this on 1.6 by using the following script on databricks cloud community edition. {code} val rdd = sc.makeRDD( """{"st": {"x.y": 1}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 20}""" :: Nil) sqlContext.read.json(rdd).registerTempTable("test") sqlContext.sql("select first(st) as st from test group by age").show() {code} > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332532#comment-15332532 ] Sean Zhong commented on SPARK-15786: The exception stack is: {code} scala> res4.as[(Option[(Int, Int)], Option[(Int, Int)])].collect() 16/06/15 13:46:18 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, Column 109: No applicable constructor/method found for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, int)" /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private MutableRow mutableRow; /* 009 */ private org.apache.spark.serializer.KryoSerializerInstance serializer; /* 010 */ private org.apache.spark.serializer.KryoSerializerInstance serializer1; /* 011 */ /* 012 */ /* 013 */ public SpecificSafeProjection(Object[] references) { /* 014 */ this.references = references; /* 015 */ mutableRow = (MutableRow) references[references.length - 1]; /* 016 */ /* 017 */ if (org.apache.spark.SparkEnv.get() == null) { /* 018 */ serializer = (org.apache.spark.serializer.KryoSerializerInstance) new org.apache.spark.serializer.KryoSerializer(new org.apache.spark.SparkConf()).newInstance(); /* 019 */ } else { /* 020 */ serializer = (org.apache.spark.serializer.KryoSerializerInstance) new org.apache.spark.serializer.KryoSerializer(org.apache.spark.SparkEnv.get().conf()).newInstance(); /* 021 */ } /* 022 */ /* 023 */ /* 024 */ if (org.apache.spark.SparkEnv.get() == null) { /* 025 */ serializer1 = (org.apache.spark.serializer.KryoSerializerInstance) new org.apache.spark.serializer.KryoSerializer(new org.apache.spark.SparkConf()).newInstance(); /* 026 */ } else { /* 027 */ serializer1 = (org.apache.spark.serializer.KryoSerializerInstance) new org.apache.spark.serializer.KryoSerializer(org.apache.spark.SparkEnv.get().conf()).newInstance(); /* 028 */ } /* 029 */ /* 030 */ } /* 031 */ /* 032 */ public java.lang.Object apply(java.lang.Object _i) { /* 033 */ InternalRow i = (InternalRow) _i; /* 034 */ /* 035 */ boolean isNull2 = i.isNullAt(0); /* 036 */ InternalRow value2 = isNull2 ? null : (i.getStruct(0, 2)); /* 037 */ final scala.Option value1 = isNull2 ? null : (scala.Option) serializer.deserialize(java.nio.ByteBuffer.wrap(value2), null); /* 038 */ /* 039 */ boolean isNull4 = i.isNullAt(1); /* 040 */ InternalRow value4 = isNull4 ? null : (i.getStruct(1, 2)); /* 041 */ final scala.Option value3 = isNull4 ? null : (scala.Option) serializer1.deserialize(java.nio.ByteBuffer.wrap(value4), null); /* 042 */ /* 043 */ /* 044 */ final scala.Tuple2 value = false ? null : new scala.Tuple2(value1, value3); /* 045 */ if (false) { /* 046 */ mutableRow.setNullAt(0); /* 047 */ } else { /* 048 */ /* 049 */ mutableRow.update(0, value); /* 050 */ } /* 051 */ /* 052 */ return mutableRow; /* 053 */ } /* 054 */ } org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, Column 109: No applicable constructor/method found for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, int)" at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) at org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7559) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7429) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7333) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5663) at org.codehaus.janino.UnitCompiler.access$13800(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:5132) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3971) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5159) at org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7533) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7429) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7333) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3873) at
[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332527#comment-15332527 ] Sean Zhong commented on SPARK-15786: Basically, what you described can be shorten to: {code} scala> val ds = Seq((1,1) -> (1, 1)).toDS() res4: org.apache.spark.sql.Dataset[((Int, Int), (Int, Int))] = [_1: struct<_1: int, _2: int>, _2: struct<_1: int, _2: int>] scala> implicit val enc = Encoders.tuple(Encoders.kryo[Option[(Int, Int)]], Encoders.kryo[Option[(Int, Int)]]) enc: org.apache.spark.sql.Encoder[(Option[(Int, Int)], Option[(Int, Int)])] = class[_1[0]: binary, _2[0]: binary] scala> ds.as[(Option[(Int, Int)], Option[(Int, Int)])].collect() {code} > joinWith bytecode generation calling ByteBuffer.wrap with InternalRow > - > > Key: SPARK-15786 > URL: https://issues.apache.org/jira/browse/SPARK-15786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Richard Marscher > > {code}java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 36, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates > are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", > "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, > int)"{code} > I have been trying to use joinWith along with Option data types to get an > approximation of the RDD semantics for outer joins with Dataset to have a > nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode > generation trying to pass an InternalRow object into the ByteBuffer.wrap > function which expects byte[] with or without a couple int qualifiers. > I have a notebook reproducing this against 2.0 preview in Databricks > Community Edition: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for source code backward compatiblity
[ https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-15914: --- Description: We removed some deprecated method in SQLContext in branch Spark 2.0. For example: {code} @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0") def jsonFile(path: String): DataFrame = { read.json(path) } {code} These deprecated method may be used by existing third party data source. We probably want to add them back to remain source code level backward compatibility. was: We removed some deprecated method in SQLContext in branch Spark 2.0. For example: {code} @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0") def jsonFile(path: String): DataFrame = { read.json(path) } {code} These deprecated method may be used by existing third party data source. We probably want to add them back to remain backward-compatibiity. > Add deprecated method back to SQLContext for source code backward compatiblity > -- > > Key: SPARK-15914 > URL: https://issues.apache.org/jira/browse/SPARK-15914 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > We removed some deprecated method in SQLContext in branch Spark 2.0. > For example: > {code} > @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0") > def jsonFile(path: String): DataFrame = { > read.json(path) > } > {code} > These deprecated method may be used by existing third party data source. We > probably want to add them back to remain source code level backward > compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for source code backward compatiblity
[ https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-15914: --- Summary: Add deprecated method back to SQLContext for source code backward compatiblity (was: Add deprecated method back to SQLContext for backward compatiblity) > Add deprecated method back to SQLContext for source code backward compatiblity > -- > > Key: SPARK-15914 > URL: https://issues.apache.org/jira/browse/SPARK-15914 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > We removed some deprecated method in SQLContext in branch Spark 2.0. > For example: > {code} > @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0") > def jsonFile(path: String): DataFrame = { > read.json(path) > } > {code} > These deprecated method may be used by existing third party data source. We > probably want to add them back to remain backward-compatibiity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for backward compatiblity
[ https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-15914: --- Summary: Add deprecated method back to SQLContext for backward compatiblity (was: Add deprecated method back to SQLContext for compatiblity) > Add deprecated method back to SQLContext for backward compatiblity > -- > > Key: SPARK-15914 > URL: https://issues.apache.org/jira/browse/SPARK-15914 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > We removed some deprecated method in SQLContext in branch Spark 2.0. > For example: > {code} > @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0") > def jsonFile(path: String): DataFrame = { > read.json(path) > } > {code} > These deprecated method may be used by existing third party data source. We > probably want to add them back to remain backward-compatibiity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15914) Add deprecated method back to SQLContext for compatiblity
Sean Zhong created SPARK-15914: -- Summary: Add deprecated method back to SQLContext for compatiblity Key: SPARK-15914 URL: https://issues.apache.org/jira/browse/SPARK-15914 Project: Spark Issue Type: Bug Components: SQL Reporter: Sean Zhong We removed some deprecated method in SQLContext in branch Spark 2.0. For example: {code} @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0") def jsonFile(path: String): DataFrame = { read.json(path) } {code} These deprecated method may be used by existing third party data source. We probably want to add them back to remain backward-compatibiity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder
Sean Zhong created SPARK-15910: -- Summary: Schema is not checked when converting DataFrame to Dataset using Kryo encoder Key: SPARK-15910 URL: https://issues.apache.org/jira/browse/SPARK-15910 Project: Spark Issue Type: Bug Components: SQL Reporter: Sean Zhong Here is the case to reproduce it: {code} scala> import org.apache.spark.sql.Encoders._ scala> import org.apache.spark.sql.Encoders scala> import org.apache.spark.sql.Encoder scala> case class B(b: Int) scala> implicit val encoder = Encoders.kryo[B] encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary] scala> val ds = Seq((1)).toDF("b").as[B].map(identity) ds: org.apache.spark.sql.Dataset[B] = [value: binary] scala> ds.show() 16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 168: No applicable constructor/method found for actual parameters "int"; candidates are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, int)" ... {code} The expected behavior is to report schema check failure earlier when creating Dataset using {code}dataFrame.as[B]{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15792) [SQL] Allows operator to change the verbosity in explain output.
Sean Zhong created SPARK-15792: -- Summary: [SQL] Allows operator to change the verbosity in explain output. Key: SPARK-15792 URL: https://issues.apache.org/jira/browse/SPARK-15792 Project: Spark Issue Type: Improvement Reporter: Sean Zhong Priority: Minor We should allows an operator (Physical plan or logical plan) to change verbosity in explain output. For example, we may not want to display {{output=[count(a)#48L]}} in less-verbose mode. {code} scala> spark.sql("select count(a) from df").explain() == Physical Plan == *HashAggregate(key=[], functions=[count(1)], output=[count(a)#48L]) +- Exchange SinglePartition +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#50L]) +- LocalTableScan {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15632) Dataset typed filter operation changes query plan schema
[ https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316998#comment-15316998 ] Sean Zhong commented on SPARK-15632: *Root cause analysis:* *The root cause is that the implementation of {{dataset.as\[B\]}} is wrong.* {code} scala> val df = Seq((1,2,3)).toDF("a", "b", "c") scala> case class B(b: Int) scala> val b = df.as[B] {code} We expect b to be a "view" of underlying df, which *SHOULD* only returns *ONE* column, but now b contains all the data. *What we expect:* {code} scala> b.show() +---+ | b| +---+ | 2| +---+ {code} *What it behaves now:* {code} scala> b.show() +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 3| +---+---+---+ {code} > Dataset typed filter operation changes query plan schema > > > Key: SPARK-15632 > URL: https://issues.apache.org/jira/browse/SPARK-15632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiang Zhong > > h1. Overview > Filter operations should never change query plan schema. However, Dataset > typed filter operation does introduce schema change in some cases. > Furthermore, all the following aspects of the schema may be changed: > # field order, > # field number, > # field data type, > # field name, and > # field nullability > This is mostly because we wrap the actual {{Filter}} operator with a > {{SerializeFromObject}}/{{DeserializeToObject}} pair (query plan fragment > illustrated as following), which performs a bunch of magic tricks. > {noformat} > SerializeFromObject > Filter > DeserializeToObject > > {noformat} > h1. Reproduction > h2. Field order, field number, and field data type change > {code} > case class A(b: Double, a: String) > val data = Seq( > "{ 'a': 'foo', 'b': 1, 'c': 'extra' }", > "{ 'a': 'bar', 'b': 2, 'c': 'extra' }", > "{ 'a': 'bar', 'c': 'extra' }" > ) > val df1 = spark.read.json(sc.parallelize(data)) > df1.printSchema() > // root > // |-- a: string (nullable = true) > // |-- b: long (nullable = true) > // |-- c: string (nullable = true) > val ds1 = df1.as[A] > ds1.printSchema() > // root > // |-- a: string (nullable = true) > // |-- b: long (nullable = true) > // |-- c: string (nullable = true) > val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker > ds2.printSchema() > // root <- 1. Reordered `a` and `b`, and > // |-- b: double (nullable = true)2. dropped `c`, and > // |-- a: string (nullable = true)3. up-casted `b` from long to double > val df2 = ds2.toDF() > df2.printSchema() > // root <- (Same as above) > // |-- b: double (nullable = true) > // |-- a: string (nullable = true) > {code} > h3. Field order change > {{DeserializeToObject}} resolves the encoder deserializer expression by > *name*. Thus field order in input query plan doesn't matter. > h3. Field number change > Same as above, fields not referred by the encoder are silently dropped while > resolving deserializer expressions by name. > h3. Field data type change > When generating deserializer expressions, we allows "sane" implicit coercions > (e.g. integer to long, and long to double) by inserting {{UpCast}} operators. > Thus actual field data types in input query plan don't matter either as long > as there are valid implicit coercions. > h2. Field name and nullability change > {code} > val ds3 = spark.range(10) > ds3.printSchema() > // root > // |-- id: long (nullable = false) > val ds4 = ds3.filter(_ > 3) > ds4.printSchema() > // root > // |-- value: long (nullable = true) 4. Name changed from `id` to `value`, > and > // 5. nullability changed from false to > true > {code} > h3. Field name change > Primitive encoders like {{Encoder\[Long\]}} doesn't have a named field, thus > they always has only a single field with hard-coded name "value". On the > other hand, when serializing domain objects back to rows, schema of > {{SerializeFromObject}} is solely determined by the encoder. Thus the > original name "id" becomes "value". > h3. Nullability change > [PR #11880|https://github.com/apache/spark/pull/11880] updated return type of > {{SparkSession.range}} from {{Dataset\[Long\]}} to > {{Dataset\[java.lang.Long\]}} due to > [SI-4388|https://issues.scala-lang.org/browse/SI-4388]. As a consequence, > although the underlying {{Range}} operator produces non-nullable output, the > result encoder is nullable since {{java.lang.Long}} is nullable. Thus, we > observe nullability change after typed filtering because serializer > expression is derived from encoder rather than the query plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-15632) Dataset typed filter operation changes query plan schema
[ https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316996#comment-15316996 ] Sean Zhong commented on SPARK-15632: There are more issues linked with this bug which we may to fix altogether. CASE1: MAP(identity) changed the table content and schema: As you know, identity is a scala function which basically means doing nothing. {code} scala> val df = Seq((1,2,3)).toDF("a", "b", "c") scala> case class B(b: Int) scala> val b = df.as[B] scala> b.show() +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 3| +---+---+---+ scala> b.map(identity).show() +---+ | b| +---+ | 2| +---+ {code} CASE2: Filter changed the column name: Filter changed the column name from id to value. {code} scala> val df = spark.range(1,3) df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.show() +---+ | id| +---+ | 1| | 2| +---+ scala> df.filter(_ => true).show() +-+ |value| +-+ |1| |2| +-+ {code} CASE3: Able to select a non-existing field: {code} scala> val df = Seq((1,2,3)).toDF("a", "b", "c") scala> case class B(b: Int) scala> val b = df.as[B] scala> b.getClass res7: Class[_ <: org.apache.spark.sql.Dataset[B]] = class org.apache.spark.sql.Dataset scala> b.select("c") // "c" is NOT a valid field name of class B. scala> b.select("c").show() +---+ | c| +---+ | 3| +---+ {code} > Dataset typed filter operation changes query plan schema > > > Key: SPARK-15632 > URL: https://issues.apache.org/jira/browse/SPARK-15632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiang Zhong > > h1. Overview > Filter operations should never change query plan schema. However, Dataset > typed filter operation does introduce schema change in some cases. > Furthermore, all the following aspects of the schema may be changed: > # field order, > # field number, > # field data type, > # field name, and > # field nullability > This is mostly because we wrap the actual {{Filter}} operator with a > {{SerializeFromObject}}/{{DeserializeToObject}} pair (query plan fragment > illustrated as following), which performs a bunch of magic tricks. > {noformat} > SerializeFromObject > Filter > DeserializeToObject > > {noformat} > h1. Reproduction > h2. Field order, field number, and field data type change > {code} > case class A(b: Double, a: String) > val data = Seq( > "{ 'a': 'foo', 'b': 1, 'c': 'extra' }", > "{ 'a': 'bar', 'b': 2, 'c': 'extra' }", > "{ 'a': 'bar', 'c': 'extra' }" > ) > val df1 = spark.read.json(sc.parallelize(data)) > df1.printSchema() > // root > // |-- a: string (nullable = true) > // |-- b: long (nullable = true) > // |-- c: string (nullable = true) > val ds1 = df1.as[A] > ds1.printSchema() > // root > // |-- a: string (nullable = true) > // |-- b: long (nullable = true) > // |-- c: string (nullable = true) > val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker > ds2.printSchema() > // root <- 1. Reordered `a` and `b`, and > // |-- b: double (nullable = true)2. dropped `c`, and > // |-- a: string (nullable = true)3. up-casted `b` from long to double > val df2 = ds2.toDF() > df2.printSchema() > // root <- (Same as above) > // |-- b: double (nullable = true) > // |-- a: string (nullable = true) > {code} > h3. Field order change > {{DeserializeToObject}} resolves the encoder deserializer expression by > *name*. Thus field order in input query plan doesn't matter. > h3. Field number change > Same as above, fields not referred by the encoder are silently dropped while > resolving deserializer expressions by name. > h3. Field data type change > When generating deserializer expressions, we allows "sane" implicit coercions > (e.g. integer to long, and long to double) by inserting {{UpCast}} operators. > Thus actual field data types in input query plan don't matter either as long > as there are valid implicit coercions. > h2. Field name and nullability change > {code} > val ds3 = spark.range(10) > ds3.printSchema() > // root > // |-- id: long (nullable = false) > val ds4 = ds3.filter(_ > 3) > ds4.printSchema() > // root > // |-- value: long (nullable = true) 4. Name changed from `id` to `value`, > and > // 5. nullability changed from false to > true > {code} > h3. Field name change > Primitive encoders like {{Encoder\[Long\]}} doesn't have a named field, thus > they always has only a single field with hard-coded name "value". On the > other hand, when serializing domain objects back to rows, schema of > {{SerializeFromObject}} is solely determined by the encoder. Thus the > original name "id" becomes "value". > h3. Nullability
[jira] [Created] (SPARK-15734) Avoids printing internal row in explain output
Sean Zhong created SPARK-15734: -- Summary: Avoids printing internal row in explain output Key: SPARK-15734 URL: https://issues.apache.org/jira/browse/SPARK-15734 Project: Spark Issue Type: Improvement Components: SQL Reporter: Sean Zhong Priority: Minor Avoids printing internal rows in explain out In explain output, some operator may prints internal Rows. For example, for LocalRelation, the explain output may looks like this: {code} scala> (1 to 10).toSeq.map(_ => (1,2,3)).toDF().createTempView("df3") scala> spark.sql("select * from df3 where 1=2").explain(true) ... == Analyzed Logical Plan == _1: int, _2: int, _3: int Project [_1#37,_2#38,_3#39] +- Filter (1 = 2) +- SubqueryAlias df3 +- LocalRelation [_1#37,_2#38,_3#39], [[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]] ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15733) Makes the explain output less verbose by hiding some verbose output like None, null, empty List, and etc..
Sean Zhong created SPARK-15733: -- Summary: Makes the explain output less verbose by hiding some verbose output like None, null, empty List, and etc.. Key: SPARK-15733 URL: https://issues.apache.org/jira/browse/SPARK-15733 Project: Spark Issue Type: Improvement Components: SQL Reporter: Sean Zhong Priority: Minor In the output of {{dataframe.explain()}}, we often see verbose string pattern like {{None, [], {}, null, Some(tableName)}}. For example, for {{spark.sql("show tables").explain()}}, we get: {code} == Physical Plan == ExecutedCommand : +- ShowTablesCommand None, None {code} It will be easier to read if we can optimize it to: {code} == Physical Plan == ExecutedCommand : +- ShowTablesCommand {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15495) Improve the output of explain for aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-15495: --- Description: We should improves the explain output of Aggregator operator to make it more readable. For example: rename TungstenAggregate to Aggregate was: The current output of explain has several problems: 1. the output format is not consistent. 2. Some key information is missing. 3. For some nodes, the output message is too verbose. 4. We don't know the output fields of each physical fields. and ... I'll add more comments later. > Improve the output of explain for aggregate operator > > > Key: SPARK-15495 > URL: https://issues.apache.org/jira/browse/SPARK-15495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sean Zhong >Assignee: Sean Zhong >Priority: Minor > > We should improves the explain output of Aggregator operator to make it more > readable. > For example: > rename TungstenAggregate to Aggregate -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15692) Improves the explain output of several physical plans by displaying embedded logical plan in tree style
Sean Zhong created SPARK-15692: -- Summary: Improves the explain output of several physical plans by displaying embedded logical plan in tree style Key: SPARK-15692 URL: https://issues.apache.org/jira/browse/SPARK-15692 Project: Spark Issue Type: Bug Components: SQL Reporter: Sean Zhong Priority: Minor Some physical plan contains a embedded logical plan, for example, {code}cache tableName query{code} maps to: {code} case class CacheTableCommand( tableName: String, plan: Option[LogicalPlan], isLazy: Boolean) extends RunnableCommand {code} It is easier to read the explain output if we can display the `plan` in tree style. Before change: {code} scala> Seq((1,2)).toDF().createOrReplaceTempView("testView") scala> spark.sql("cache table testView2 select * from testView").explain() == Physical Plan == ExecutedCommand CacheTableCommand testView2, Some('Project [*] +- 'UnresolvedRelation `testView`, None ), false {code} After change: {code} scala> spark.sql("cache table testView2 select * from testView").explain() == Physical Plan == ExecutedCommand : +- CacheTableCommand testView2, false : : +- 'Project [*] : : +- 'UnresolvedRelation `testView`, None {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15674) Deprecates "CREATE TEMPORARY TABLE USING...", use "CREATE TEMPORARY VIEW USING..." instead.
Sean Zhong created SPARK-15674: -- Summary: Deprecates "CREATE TEMPORARY TABLE USING...", use "CREATE TEMPORARY VIEW USING..." instead. Key: SPARK-15674 URL: https://issues.apache.org/jira/browse/SPARK-15674 Project: Spark Issue Type: Improvement Components: SQL Reporter: Sean Zhong Priority: Minor The current implementation of "CREATE TEMPORARY TABLE USING..." is actually creating a temporary VIEW behind the scene. We probably should just use "CREATE TEMPORARY VIEW USING..." instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15495) Improve the output of explain for aggregate operator
Sean Zhong created SPARK-15495: -- Summary: Improve the output of explain for aggregate operator Key: SPARK-15495 URL: https://issues.apache.org/jira/browse/SPARK-15495 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.2, 2.0.0 Reporter: Sean Zhong Priority: Minor The current output of explain has several problems: 1. the output format is not consistent. 2. Some key information is missing. 3. For some nodes, the output message is too verbose. 4. We don't know the output fields of each physical fields. and ... I'll add more comments later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15334) HiveClient facade not compatible with Hive 0.12
[ https://issues.apache.org/jira/browse/SPARK-15334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-15334: --- Summary: HiveClient facade not compatible with Hive 0.12 (was: [SPARK][SQL] HiveClient facade not compatible with Hive 0.12) > HiveClient facade not compatible with Hive 0.12 > --- > > Key: SPARK-15334 > URL: https://issues.apache.org/jira/browse/SPARK-15334 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Minor > > Compatibility issues: > 1. org.apache.spark.sql.hive.client.HiveClientImpl use AddPartitionDesc(db, > table, ignoreIfExists) to create partitions, while Hive 0.12 doesn't have > this constructor for AddPartitionDesc. > 2. HiveClientImpl uses PartitionDropOptions when dropping partition, however, > PartitionDropOptions doesn't exist in Hive 0.12. > 3. Hive 0.12 doesn't support adding permanent functions. It is not valid to > call "org.apache.hadoop.hive.ql.metadata.Hive.createFunction", > "org.apache.hadoop.hive.ql.metadata.Hive.alterFunction", and > "org.apache.hadoop.hive.ql.metadata.Hive.alterFunction" > 4. org.apache.spark.sql.hive.client.HiveClient doesn't have enough test > coverage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org