[jira] [Updated] (SPARK-17617) Remainder(%) expression.eval returns incorrect result

2016-09-20 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17617:
---
Description: 
h2.Problem

Remainder(%) expression returns incorrect result when using expression.eval to 
calculate the result. expression.eval is called in case like constant folding.

{code}
scala> -5083676433652386516D  % 10
res19: Double = -6.0

// Wrong answer with eval!!!
scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show
|(value % 10)|
++
| 0.0|
++

// Triggers codegen, will  not do constant folding
scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 10).show
++
|(value % 10)|
++
|-6.0|
++
{code}

Behavior of postgres:
{code}
seanzhong=# select -5083676433652386516.0  % 10;
 ?column? 
--
 -6.0
(1 row)
{code}



  was:
h2.Problem

Remainder(%) expression returns incorrect result when eval is called to do 
constant folding

{code}
scala> -5083676433652386516D  % 10
res19: Double = -6.0

// Wrong answer because of constant folding!!!
scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show
|(value % 10)|
++
| 0.0|
++

// Triggers codegen, will  not do constant folding
scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 10).show
++
|(value % 10)|
++
|-6.0|
++
{code}

Behavior of postgres:
{code}
seanzhong=# select -5083676433652386516.0  % 10;
 ?column? 
--
 -6.0
(1 row)
{code}




> Remainder(%) expression.eval returns incorrect result
> -
>
> Key: SPARK-17617
> URL: https://issues.apache.org/jira/browse/SPARK-17617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> h2.Problem
> Remainder(%) expression returns incorrect result when using expression.eval 
> to calculate the result. expression.eval is called in case like constant 
> folding.
> {code}
> scala> -5083676433652386516D  % 10
> res19: Double = -6.0
> // Wrong answer with eval!!!
> scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show
> |(value % 10)|
> ++
> | 0.0|
> ++
> // Triggers codegen, will  not do constant folding
> scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 
> 10).show
> ++
> |(value % 10)|
> ++
> |-6.0|
> ++
> {code}
> Behavior of postgres:
> {code}
> seanzhong=# select -5083676433652386516.0  % 10;
>  ?column? 
> --
>  -6.0
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17617) Remainder(%) expression.eval returns incorrect result

2016-09-20 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17617:
--

 Summary: Remainder(%) expression.eval returns incorrect result
 Key: SPARK-17617
 URL: https://issues.apache.org/jira/browse/SPARK-17617
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sean Zhong


h2.Problem

Remainder(%) expression returns incorrect result when eval is called to do 
constant folding

{code}
scala> -5083676433652386516D  % 10
res19: Double = -6.0

// Wrong answer because of constant folding!!!
scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show
|(value % 10)|
++
| 0.0|
++

// Triggers codegen, will  not do constant folding
scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 10).show
++
|(value % 10)|
++
|-6.0|
++
{code}

Behavior of postgres:
{code}
seanzhong=# select -5083676433652386516.0  % 10;
 ?column? 
--
 -6.0
(1 row)
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory

2016-09-12 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17503:
---
Attachment: Screen Shot 2016-09-12 at 4.34.19 PM.png
Screen Shot 2016-09-12 at 4.16.15 PM.png

> Memory leak in Memory store when unable to cache the whole RDD in memory
> 
>
> Key: SPARK-17503
> URL: https://issues.apache.org/jira/browse/SPARK-17503
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Sean Zhong
> Attachments: Screen Shot 2016-09-12 at 4.16.15 PM.png, Screen Shot 
> 2016-09-12 at 4.34.19 PM.png
>
>
> h2.Problem description:
> The following query triggers out of memory error.  
> {code}
> sc.parallelize(1 to 10, 100).map(x => new 
> Array[Long](1000)).cache().count()
> {code}
> This is not expected, we should fallback to use disk instead if there is not 
> enough memory for cache.
> Stacktrace:
> {code}
> scala> sc.parallelize(1 to 10, 100).map(x => new 
> Array[Long](1000)).cache().count()
> [Stage 0:>  (0 + 5) / 
> 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
> 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
> memory! (computed 947.3 MB so far)
> 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
> 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
> memory! (computed 1423.7 MB so far)
> 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid26528.hprof ...
> Heap dump file created [6551021666 bytes in 9.876 secs]
> 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
> 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
> 55360))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:190)
>   at 

[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory

2016-09-12 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17503:
---
Description: 
h2.Problem description:

The following query triggers out of memory error.  

{code}
sc.parallelize(1 to 10, 100).map(x => new 
Array[Long](1000)).cache().count()
{code}

This is not expected, we should fallback to use disk instead if there is not 
enough memory for cache.

Stacktrace:
{code}
scala> sc.parallelize(1 to 10, 100).map(x => new 
Array[Long](1000)).cache().count()
[Stage 0:>  (0 + 5) / 
5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
memory! (computed 947.3 MB so far)
16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
memory! (computed 1423.7 MB so far)
16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid26528.hprof ...
Heap dump file created [6551021666 bytes in 9.876 secs]
16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
55360))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.OutOfMemoryError: Java heap space
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24)
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
at 
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)

[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory

2016-09-12 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17503:
---
Affects Version/s: (was: 1.6.2)
   (was: 2.0.0)
 Target Version/s: 2.1.0  (was: 2.0.0, 2.1.0)

> Memory leak in Memory store when unable to cache the whole RDD in memory
> 
>
> Key: SPARK-17503
> URL: https://issues.apache.org/jira/browse/SPARK-17503
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Sean Zhong
>
> h2.Problem description:
> The following query triggers out of memory error.  
> {code}
> sc.parallelize(1 to 1000, 5).map(x => new Array[Long](1000)).cache().count
> {code}
> This is not expected, we should fallback to use disk instead if there is not 
> enough memory for cache.
> Stacktrace:
> {code}
> scala> sc.parallelize(1 to 1000, 5).map(x => new 
> Array[Long](1000)).cache().count
> [Stage 0:>  (0 + 5) / 
> 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
> 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
> memory! (computed 947.3 MB so far)
> 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
> 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
> memory! (computed 1423.7 MB so far)
> 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid26528.hprof ...
> Heap dump file created [6551021666 bytes in 9.876 secs]
> 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
> 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
> 55360))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:190)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
>   ... 14 more
> 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
> 

[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory

2016-09-12 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17503:
---
Description: 
h2.Problem description:

The following query triggers out of memory error.  

{code}
sc.parallelize(1 to 1000, 5).map(x => new Array[Long](1000)).cache().count
{code}

This is not expected, we should fallback to use disk instead if there is not 
enough memory for cache.

Stacktrace:
{code}
scala> sc.parallelize(1 to 1000, 5).map(x => new 
Array[Long](1000)).cache().count
[Stage 0:>  (0 + 5) / 
5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
memory! (computed 947.3 MB so far)
16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
memory! (computed 1423.7 MB so far)
16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid26528.hprof ...
Heap dump file created [6551021666 bytes in 9.876 secs]
16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
55360))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.OutOfMemoryError: Java heap space
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24)
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
at 
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
at 

[jira] [Commented] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD

2016-09-12 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483389#comment-15483389
 ] 

Sean Zhong commented on SPARK-17503:


[~sowen] I have modified the title to mean "cache in memory"

> Memory leak in Memory store when unable to cache the whole RDD
> --
>
> Key: SPARK-17503
> URL: https://issues.apache.org/jira/browse/SPARK-17503
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
>Reporter: Sean Zhong
>
> h2.Problem description:
> The following query triggers out of memory error.  
> {code}
> sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count
> {code}
> This is not expected, we should fallback to use disk instead if there is not 
> enough memory for cache.
> Stacktrace:
> {code}
> scala> sc.parallelize(1 to 1000, 5).map(f).cache().count
> [Stage 0:>  (0 + 5) / 
> 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
> 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
> memory! (computed 947.3 MB so far)
> 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
> 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
> memory! (computed 1423.7 MB so far)
> 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid26528.hprof ...
> Heap dump file created [6551021666 bytes in 9.876 secs]
> 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
> 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
> 55360))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:190)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
>   ... 14 more
> 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
> java.lang.OutOfMemoryError: Java heap space
>   at 
> 

[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory

2016-09-12 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17503:
---
Summary: Memory leak in Memory store when unable to cache the whole RDD in 
memory  (was: Memory leak in Memory store when unable to cache the whole RDD)

> Memory leak in Memory store when unable to cache the whole RDD in memory
> 
>
> Key: SPARK-17503
> URL: https://issues.apache.org/jira/browse/SPARK-17503
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
>Reporter: Sean Zhong
>
> h2.Problem description:
> The following query triggers out of memory error.  
> {code}
> sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count
> {code}
> This is not expected, we should fallback to use disk instead if there is not 
> enough memory for cache.
> Stacktrace:
> {code}
> scala> sc.parallelize(1 to 1000, 5).map(f).cache().count
> [Stage 0:>  (0 + 5) / 
> 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
> 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
> memory! (computed 947.3 MB so far)
> 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
> 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
> memory! (computed 1423.7 MB so far)
> 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid26528.hprof ...
> Heap dump file created [6551021666 bytes in 9.876 secs]
> 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
> 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
> 55360))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:190)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
>   ... 14 more
> 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 

[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD

2016-09-12 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17503:
---
Description: 
h2.Problem description:

The following query triggers out of memory error.  

{code}
sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count
{code}

This is not expected, we should fallback to use disk instead if there is not 
enough memory for cache.

Stacktrace:
{code}
scala> sc.parallelize(1 to 1000, 5).map(f).cache().count
[Stage 0:>  (0 + 5) / 
5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
memory! (computed 947.3 MB so far)
16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
memory! (computed 1423.7 MB so far)
16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid26528.hprof ...
Heap dump file created [6551021666 bytes in 9.876 secs]
16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
55360))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.OutOfMemoryError: Java heap space
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24)
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
at 
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
at 

[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD

2016-09-12 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17503:
---
Description: 
h2.Problem description:

The following query triggers out of memory error.  

{code}
sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count
{code}

This is not expected, we should fallback to use disk instead if there is not 
enough memory for cache.

Stacktrace:
{code}
scala> sc.parallelize(1 to 1000, 5).map(f).cache().count
[Stage 0:>  (0 + 5) / 
5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
memory! (computed 947.3 MB so far)
16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
memory! (computed 1423.7 MB so far)
16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid26528.hprof ...
Heap dump file created [6551021666 bytes in 9.876 secs]
16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
55360))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.OutOfMemoryError: Java heap space
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24)
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
at 
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
at 

[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD

2016-09-12 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17503:
---
Description: 
h2.Problem description:

The following query triggers out of memory error.  

{code}
sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count
{code}

This is not expected, we should fallback to use disk instead if there is not 
enough memory for cache.

Stacktrace:
{code}
scala> sc.parallelize(1 to 1000, 5).map(f).cache().count
[Stage 0:>  (0 + 5) / 
5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
memory! (computed 947.3 MB so far)
16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
memory! (computed 1423.7 MB so far)
16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid26528.hprof ...
Heap dump file created [6551021666 bytes in 9.876 secs]
16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
55360))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.OutOfMemoryError: Java heap space
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24)
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
at 
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
at 

[jira] [Created] (SPARK-17503) Memory leak in Memory store which unable to cache whole RDD

2016-09-12 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17503:
--

 Summary: Memory leak in Memory store which unable to cache whole 
RDD
 Key: SPARK-17503
 URL: https://issues.apache.org/jira/browse/SPARK-17503
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.2, 2.1.0
Reporter: Sean Zhong


h2.Problem description

The following query triggers out of memory error.  

{code}
sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count
{code}

This is not expected, we should fallback to use disk instead if there is not 
enough memory for cache.

Stacktrace:
{code}
scala> sc.parallelize(1 to 1000, 5).map(f).cache().count
[Stage 0:>  (0 + 5) / 
5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
memory! (computed 631.5 MB so far)
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
memory! (computed 947.3 MB so far)
16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
memory! (computed 1423.7 MB so far)
16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid26528.hprof ...
Heap dump file created [6551021666 bytes in 9.876 secs]
16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
55360))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.OutOfMemoryError: Java heap space
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24)
at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
at 
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
  

[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD

2016-09-12 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17503:
---
Summary: Memory leak in Memory store when unable to cache the whole RDD  
(was: Memory leak in Memory store which unable to cache whole RDD)

> Memory leak in Memory store when unable to cache the whole RDD
> --
>
> Key: SPARK-17503
> URL: https://issues.apache.org/jira/browse/SPARK-17503
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
>Reporter: Sean Zhong
>
> h2.Problem description
> The following query triggers out of memory error.  
> {code}
> sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count
> {code}
> This is not expected, we should fallback to use disk instead if there is not 
> enough memory for cache.
> Stacktrace:
> {code}
> scala> sc.parallelize(1 to 1000, 5).map(f).cache().count
> [Stage 0:>  (0 + 5) / 
> 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
> 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
> memory! (computed 947.3 MB so far)
> 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
> 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
> memory! (computed 1423.7 MB so far)
> 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid26528.hprof ...
> Heap dump file created [6551021666 bytes in 9.876 secs]
> 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
> 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
> 55360))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:190)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
>   ... 14 more
> 16/09/11 17:28:15 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
> java.lang.OutOfMemoryError: 

[jira] [Commented] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472189#comment-15472189
 ] 

Sean Zhong commented on SPARK-17364:


I have a trial fix at https://github.com/apache/spark/pull/15006

Which will tokenize temp.20160826_ip_list as:
{code}
temp // Matches the IDENTIFIER lexer rule
. // Matches single dot
20160826_ip_list // Matches the IDENTIFIER lexer rule.
{code}



> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472182#comment-15472182
 ] 

Sean Zhong edited comment on SPARK-17364 at 9/7/16 11:56 PM:
-

[~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to 
tokens like
{code}
// temp.20160826_ip_list is break downs to:
temp  // Matches the IDENTIFIER lexer rule
.20160826 // Matches the DECIMAL_VALUE lexer rule.
_ip_list  // Matches the IDENTIFIER lexer rule.
{code}




was (Author: clockfly):
[~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to 
tokens like
{code}
// temp.20160826_ip_list is break downs to:
temp
.20160826
_ip_list
{code}



> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472182#comment-15472182
 ] 

Sean Zhong edited comment on SPARK-17364 at 9/7/16 11:55 PM:
-

[~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to 
tokens like
{code}
// temp.20160826_ip_list is break downs to:
temp
.20160826
_ip_list
{code}




was (Author: clockfly):
[~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to 
token like
{code}
// temp.20160826_ip_list is break downs to:
temp
.20160826
_ip_list
{code}



> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472182#comment-15472182
 ] 

Sean Zhong commented on SPARK-17364:


[~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to 
token like
{code}
// temp.20160826_ip_list is break downs to:
temp
.20160826
_ip_list
{code}



> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17426:
---
Target Version/s: 2.1.0
 Description: 
In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
other cases that may also trigger OOM. Current implementation of 
TreeNode.toJSON will recursively search and print all fields of current 
TreeNode, even if the field's type is of type Seq or type Map. 

This is not safe because:
1. the Seq or Map can be very big. Converting them to JSON make take huge 
memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The input can be 
of arbitrary type, and may also be self-referencing. Trying to print user space 
to JSON input is very risky.

The following example triggers a StackOverflowError when calling toJSON on a 
plan with user defined UDF.
{code}

case class SelfReferenceUDF(
var config: Map[String, Any] = Map.empty[String, Any]) extends 
Function1[String, Boolean] {
  config += "self" -> this
  def apply(key: String): Boolean = config.contains(key)
}

test("toJSON should not throws java.lang.StackOverflowError") {
  val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
  // triggers java.lang.StackOverflowError
  udf.toJSON
}

{code}

  was:
In SPARK-17356, we fix the OOM issue when {monospace}Metadata{monospace} is 
super big. There are other cases that may also trigger OOM. Current 
implementation of TreeNode.toJSON will recursively search and print all fields 
of current TreeNode, even if the field's type is of type Seq or type Map. 

This is not safe because:
1. the Seq or Map can be very big. Converting them to JSON make take huge 
memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The input can be 
of arbitrary type, and may also be self-referencing. Trying to print user space 
to JSON input is very risky.

The following example triggers a StackOverflowError when calling toJSON on a 
plan with user defined UDF.
{code}

case class SelfReferenceUDF(
var config: Map[String, Any] = Map.empty[String, Any]) extends 
Function1[String, Boolean] {
  config += "self" -> this
  def apply(key: String): Boolean = config.contains(key)
}

test("toJSON should not throws java.lang.StackOverflowError") {
  val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
  // triggers java.lang.StackOverflowError
  udf.toJSON
}

{code}


> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
> other cases that may also trigger OOM. Current implementation of 
> TreeNode.toJSON will recursively search and print all fields of current 
> TreeNode, even if the field's type is of type Seq or type Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The input can be 
> of arbitrary type, and may also be self-referencing. Trying to print user 
> space to JSON input is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17426:
---
Description: 
In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
other cases that may also trigger OOM. Current implementation of 
TreeNode.toJSON will recursively search and print all fields of current 
TreeNode, even if the field's type is of type Seq or type Map. 

This is not safe because:
1. the Seq or Map can be very big. Converting them to JSON make take huge 
memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The user space 
input can be of arbitrary type, and may also be self-referencing. Trying to 
print user space input to JSON is very risky.

The following example triggers a StackOverflowError when calling toJSON on a 
plan with user defined UDF.
{code}

case class SelfReferenceUDF(
var config: Map[String, Any] = Map.empty[String, Any]) extends 
Function1[String, Boolean] {
  config += "self" -> this
  def apply(key: String): Boolean = config.contains(key)
}

test("toJSON should not throws java.lang.StackOverflowError") {
  val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
  // triggers java.lang.StackOverflowError
  udf.toJSON
}

{code}

  was:
In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
other cases that may also trigger OOM. Current implementation of 
TreeNode.toJSON will recursively search and print all fields of current 
TreeNode, even if the field's type is of type Seq or type Map. 

This is not safe because:
1. the Seq or Map can be very big. Converting them to JSON make take huge 
memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The input can be 
of arbitrary type, and may also be self-referencing. Trying to print user space 
to JSON input is very risky.

The following example triggers a StackOverflowError when calling toJSON on a 
plan with user defined UDF.
{code}

case class SelfReferenceUDF(
var config: Map[String, Any] = Map.empty[String, Any]) extends 
Function1[String, Boolean] {
  config += "self" -> this
  def apply(key: String): Boolean = config.contains(key)
}

test("toJSON should not throws java.lang.StackOverflowError") {
  val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
  // triggers java.lang.StackOverflowError
  udf.toJSON
}

{code}


> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
> other cases that may also trigger OOM. Current implementation of 
> TreeNode.toJSON will recursively search and print all fields of current 
> TreeNode, even if the field's type is of type Seq or type Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The user space 
> input can be of arbitrary type, and may also be self-referencing. Trying to 
> print user space input to JSON is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17426:
---
Component/s: SQL

> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> In SPARK-17356, we fix the OOM issue when {monospace}Metadata{monospace} is 
> super big. There are other cases that may also trigger OOM. Current 
> implementation of TreeNode.toJSON will recursively search and print all 
> fields of current TreeNode, even if the field's type is of type Seq or type 
> Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The input can be 
> of arbitrary type, and may also be self-referencing. Trying to print user 
> space to JSON input is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17426:
--

 Summary: Current TreeNode.toJSON may trigger OOM under some corner 
cases
 Key: SPARK-17426
 URL: https://issues.apache.org/jira/browse/SPARK-17426
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong


In SPARK-17356, we fix the OOM issue when {monospace}Metadata{monospace} is 
super big. There are other cases that may also trigger OOM. Current 
implementation of TreeNode.toJSON will recursively search and print all fields 
of current TreeNode, even if the field's type is of type Seq or type Map. 

This is not safe because:
1. the Seq or Map can be very big. Converting them to JSON make take huge 
memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The input can be 
of arbitrary type, and may also be self-referencing. Trying to print user space 
to JSON input is very risky.

The following example triggers a StackOverflowError when calling toJSON on a 
plan with user defined UDF.
{code}

case class SelfReferenceUDF(
var config: Map[String, Any] = Map.empty[String, Any]) extends 
Function1[String, Boolean] {
  config += "self" -> this
  def apply(key: String): Boolean = config.contains(key)
}

test("toJSON should not throws java.lang.StackOverflowError") {
  val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
  // triggers java.lang.StackOverflowError
  udf.toJSON
}

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17302) Cannot set non-Spark SQL session variables in hive-site.xml, spark-defaults.conf, or using --conf

2016-09-05 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465818#comment-15465818
 ] 

Sean Zhong commented on SPARK-17302:


[~rdblue] Can you write a reproducer code sample to describe your problem? To 
elaborate what's your expected behavior, and what is your current observation?

I am not sure whether the {monospace}spark.conf.set("key", "value"){monospace} 
is what you expect or not.

> Cannot set non-Spark SQL session variables in hive-site.xml, 
> spark-defaults.conf, or using --conf
> -
>
> Key: SPARK-17302
> URL: https://issues.apache.org/jira/browse/SPARK-17302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ryan Blue
>
> When configuration changed for 2.0 to the new SparkSession structure, Spark 
> stopped using Hive's internal HiveConf for session state and now uses 
> HiveSessionState and an associated SQLConf. Now, session options like 
> hive.exec.compress.output and hive.exec.dynamic.partition.mode are pulled 
> from this SQLConf. This doesn't include session properties from hive-site.xml 
> (including hive.exec.compress.output), and no longer contains Spark-specific 
> overrides from spark-defaults.conf that used the spark.hadoop.hive... pattern.
> Also, setting these variables on the command-line no longer works because 
> settings must start with "spark.".
> Is there a recommended way to set Hive session properties?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17364) Can not query hive table starting with number

2016-09-05 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465754#comment-15465754
 ] 

Sean Zhong edited comment on SPARK-17364 at 9/5/16 8:47 PM:


[~epahomov] 

Spark 2.0 rewrote the Sql parser with antlr. 
You can use the backticked version like Yuming's example.


was (Author: clockfly):
[~epahomov] 

Spark 2.0 rewrites the Sql parser with antlr. 
You can use the backticked version like Yuming's example.

> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17364) Can not query hive table starting with number

2016-09-05 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465754#comment-15465754
 ] 

Sean Zhong commented on SPARK-17364:


[~epahomov] 

Spark 2.0 rewrites the Sql parser with antlr. 
You can use the backticked version like Yuming's example.

> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17374) Improves the error message when fails to parse some json file lines in DataFrameReader

2016-09-02 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17374:
--

 Summary: Improves the error message when fails to parse some json 
file lines in DataFrameReader
 Key: SPARK-17374
 URL: https://issues.apache.org/jira/browse/SPARK-17374
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sean Zhong


Demo case:

We silently replace corrupted line with null without any error message.

{code}
import org.apache.spark.sql.types._
val corruptRecords = spark.sparkContext.parallelize("""{"a":{, b:3}""" :: Nil)
val schema = StructType(StructField("a", StringType, true) :: Nil)
val jsonDF = spark.read.schema(schema).json(corruptRecords)
scala> jsonDF.show
++
|   a|
++
|null|
++
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17369) MetastoreRelation toJSON throws exception

2016-09-01 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17369:
--

 Summary: MetastoreRelation toJSON throws exception
 Key: SPARK-17369
 URL: https://issues.apache.org/jira/browse/SPARK-17369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sean Zhong
Priority: Minor


MetastoreRelationSuite.toJSON now throws exception.

test case:
{code}
package org.apache.spark.sql.hive

import org.apache.spark.SparkFunSuite
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, 
CatalogTable, CatalogTableType}
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}

class MetastoreRelationSuite extends SparkFunSuite {
  test("makeCopy and toJSON should work") {
val table = CatalogTable(
  identifier = TableIdentifier("test", Some("db")),
  tableType = CatalogTableType.VIEW,
  storage = CatalogStorageFormat.empty,
  schema = StructType(StructField("a", IntegerType, true) :: Nil))
val relation = MetastoreRelation("db", "test")(table, null, null)

// No exception should be thrown
relation.makeCopy(Array("db", "test"))
// No exception should be thrown
relation.toJSON
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454553#comment-15454553
 ] 

Sean Zhong commented on SPARK-17356:


Reproducer:

{code}
# Trigger OOM
scala> :paste -raw
// Entering paste mode (ctrl-D to finish)

package org.apache.spark.ml.attribute

import org.apache.spark.ml.attribute._
import org.apache.spark.sql.catalyst.expressions.{Alias, Literal}
import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
import org.apache.spark.sql.catalyst.dsl.plans._

object Test {
  def main(args: Array[String]): Unit = {
val rand = new java.util.Random()
val attr: Attribute = new BinaryAttribute(Some("a"), 
Some(rand.nextInt(10)), Some(Array("value1", "value2")))
val attributeGroup = new AttributeGroup("group", Array.fill(100)(attr))
val alias = Alias(Literal(0), "alias")(explicitMetadata = 
Some(attributeGroup.toMetadata()))
val testRelation = LocalRelation()
val query = testRelation.select((0 to 100).toSeq.map(_ => alias): _*)
System.out.print(query.toJSON.length)
  }
}

// Exiting paste mode, now interpreting.

scala> org.apache.spark.ml.attribute.Test.main(null)
{code}

> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: jmap.txt, jstack.txt, queryplan.txt
>
>
> When using MLLib, when calling toJSON on a plan with many level of 
> sub-queries, it may cause out of memory exception with stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566)
> {code}
> The query plan, stack trace, and jmap distribution is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454495#comment-15454495
 ] 

Sean Zhong edited comment on SPARK-17356 at 9/1/16 6:38 AM:


*Root cause:*

1. MLLib heavily leverage MetaData to store a lot of attribute information, in 
the case here, the metadata may contains tens of thousands of Attribute 
information. And the meta data may be stored to Alias expression like this:
{code}
case class Alias(child: Expression, name: String)(
val exprId: ExprId = NamedExpression.newExprId,
val qualifier: Option[String] = None,
val explicitMetadata: Option[Metadata] = None,
override val isGenerated: java.lang.Boolean = false)
{code} 

If we serialize the meta data to JSON, it will take a huge amount of memory.



was (Author: clockfly):
Root cause:

1. MLLib heavily leverage MetaData to store a lot of attribute information, in 
the case here, the metadata may contains tens of thousands of Attribute 
information. And the meta data may be stored to Alias expression like this:
{code}
case class Alias(child: Expression, name: String)(
val exprId: ExprId = NamedExpression.newExprId,
val qualifier: Option[String] = None,
val explicitMetadata: Option[Metadata] = None,
override val isGenerated: java.lang.Boolean = false)
{code} 

If we serialize the meta data to JSON, it will take a huge amount of memory.


> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: jmap.txt, jstack.txt, queryplan.txt
>
>
> When using MLLib, when calling toJSON on a plan with many level of 
> sub-queries, it may cause out of memory exception with stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566)
> {code}
> The query plan, stack trace, and jmap distribution is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454495#comment-15454495
 ] 

Sean Zhong commented on SPARK-17356:


Root cause:

1. MLLib heavily leverage MetaData to store a lot of attribute information, in 
the case here, the metadata may contains tens of thousands of Attribute 
information. And the meta data may be stored to Alias expression like this:
{code}
case class Alias(child: Expression, name: String)(
val exprId: ExprId = NamedExpression.newExprId,
val qualifier: Option[String] = None,
val explicitMetadata: Option[Metadata] = None,
override val isGenerated: java.lang.Boolean = false)
{code} 

If we serialize the meta data to JSON, it will take a huge amount of memory.


> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: jmap.txt, jstack.txt, queryplan.txt
>
>
> When using MLLib, when calling toJSON on a plan with many level of 
> sub-queries, it may cause out of memory exception with stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566)
> {code}
> The query plan, stack trace, and jmap distribution is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454488#comment-15454488
 ] 

Sean Zhong commented on SPARK-17356:


*Analysis*

After looking at the mmap, there is a suspicious line
{code}
  20: 18388624  [Lorg.apache.spark.ml.attribute.Attribute;
{code}

This means a single Attribute array takes more than 8388624 bytes, and if each 
reference takes 8 bytes, it means there are 1 million attributes.

The array probably is used in AttributeGroup, whose signature is:
{code}
class AttributeGroup private (
val name: String,
val numAttributes: Option[Int],
attrs: Option[Array[Attribute]])
{code}

And, in AttributeGroup, there is a toMetaData function which will convert the 
Attribute array to Meta data
{code}
  def toMetadata(): Metadata = toMetadata(Metadata.empty)
{code}

Finally, the metadata are saved to expression Attribute.
 
For example, in org.apache.spark.ml.feature.Interaction transform function, the 
meta data is set to attribute of Alias expression, when aliasing the udf 
function like this:

{code}
  override def transform(dataset: Dataset[_]): DataFrame = {
...
// !NOTE!: This is an attribute group
val featureAttrs = getFeatureAttrs(inputFeatures)

def interactFunc = udf { row: Row =>
  ...
}

val featureCols = inputFeatures.map { f =>
  f.dataType match {
case DoubleType => dataset(f.name)
case _: VectorUDT => dataset(f.name)
case _: NumericType | BooleanType => dataset(f.name).cast(DoubleType)
  }
}

// !NOTE!: The meta data i stored in Alias expresion by function call 
.as(..., featureAttrs.toMetadata())
dataset.select(
  col("*"),
  interactFunc(struct(featureCols: _*)).as($(outputCol), 
featureAttrs.toMetadata()))
  }

{code}

And, when calling toJSON, the metaData will be converted to JSON.

> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: jmap.txt, jstack.txt, queryplan.txt
>
>
> When using MLLib, when calling toJSON on a plan with many level of 
> sub-queries, it may cause out of memory exception with stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566)
> {code}
> The query plan, stack trace, and jmap distribution is attached.



--
This message was sent 

[jira] [Updated] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17356:
---
Description: 
When using MLLib, when calling toJSON on a plan with many level of sub-queries, 
it may cause out of memory exception with stack trace like this
{code}
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
at scala.collection.immutable.List$.newBuilder(List.scala:396)
at 
scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
at 
scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at 
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at 
com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
at 
com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566)
{code}

The query plan, stack trace, and jmap distribution is attached.



  was:
When using MLLib, TreeNode.toJSON may cause out of memory exception with stack 
trace like this
{code}
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
at scala.collection.immutable.List$.newBuilder(List.scala:396)
at 
scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
at 
scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at 
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at 
com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
at 

[jira] [Updated] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17356:
---
Attachment: jstack.txt

> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: jmap.txt, jstack.txt, queryplan.txt
>
>
> When using MLLib, TreeNode.toJSON may cause out of memory exception with 
> stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17356:
---
Attachment: jmap.txt

> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: jmap.txt, queryplan.txt
>
>
> When using MLLib, TreeNode.toJSON may cause out of memory exception with 
> stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17356:
---
Attachment: queryplan.txt

> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: queryplan.txt
>
>
> When using MLLib, TreeNode.toJSON may cause out of memory exception with 
> stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-08-31 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17356:
--

 Summary: Out of memory when calling TreeNode.toJSON
 Key: SPARK-17356
 URL: https://issues.apache.org/jira/browse/SPARK-17356
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sean Zhong


When using MLLib, TreeNode.toJSON may cause out of memory exception with stack 
trace like this
{code}
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
at scala.collection.immutable.List$.newBuilder(List.scala:396)
at 
scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
at 
scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at 
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at 
com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
at 
com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566){code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17306) Memory leak in QuantileSummaries

2016-08-29 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17306:
---
Component/s: SQL

> Memory leak in QuantileSummaries
> 
>
> Key: SPARK-17306
> URL: https://issues.apache.org/jira/browse/SPARK-17306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> compressThreshold was not referenced anywhere
> {code}
> class QuantileSummaries(
> val compressThreshold: Int,
> val relativeError: Double,
> val sampled: ArrayBuffer[Stats] = ArrayBuffer.empty,
> private[stat] var count: Long = 0L,
> val headSampled: ArrayBuffer[Double] = ArrayBuffer.empty) extends 
> Serializable
> {code}
> And, it causes memory leak, QuantileSummaries takes unbounded memory
> {code}
> val summary = new QuantileSummaries(1, relativeError = 0.001)
> // Results in creating an array of size 1 !!! 
> (1 to 1).foreach(summary.insert(_))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17306) Memory leak in QuantileSummaries

2016-08-29 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17306:
--

 Summary: Memory leak in QuantileSummaries
 Key: SPARK-17306
 URL: https://issues.apache.org/jira/browse/SPARK-17306
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong


compressThreshold was not referenced anywhere

{code}
class QuantileSummaries(
val compressThreshold: Int,
val relativeError: Double,
val sampled: ArrayBuffer[Stats] = ArrayBuffer.empty,
private[stat] var count: Long = 0L,
val headSampled: ArrayBuffer[Double] = ArrayBuffer.empty) extends 
Serializable
{code}

And, it causes memory leak, QuantileSummaries takes unbounded memory
{code}
val summary = new QuantileSummaries(1, relativeError = 0.001)
// Results in creating an array of size 1 !!! 
(1 to 1).foreach(summary.insert(_))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17289:
---
Description: 
For the following query:

{code}
val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
"b").createOrReplaceTempView("t2")
spark.sql("select max(b) from t2 group by a").explain(true)
{code}

Now, the SortAggregator won't insert Sort operator before partial aggregation, 
this will break sort-based partial aggregation.

{code}
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
  +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
max#19])
 +- LocalTableScan [a#5, b#6]
{code}

In Spark 2.0 branch, the plan is:
{code}
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
  +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
max#19])
 +- *Sort [a#5 ASC], false, 0
+- LocalTableScan [a#5, b#6]
{code}

This is related to SPARK-12978

  was:
For the following query:

{code}
val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
"b").createOrReplaceTempView("t2")
spark.sql("select max(b) from t2 group by a").explain(true)
{code}

Now, the SortAggregator won't insert Sort operator before partial aggregation, 
this will break sort-based partial aggregation.

{code}
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
  +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
max#19])
 +- LocalTableScan [a#5, b#6]
{code}

In Spark 2.0 branch, the plan is:
{code}
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
  +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
max#19])
 +- *Sort [a#5 ASC], false, 0
+- LocalTableScan [a#5, b#6]
{code}

This is related with SPARK-12978


> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17289:
--

 Summary: Sort based partial aggregation breaks due to SPARK-12978
 Key: SPARK-17289
 URL: https://issues.apache.org/jira/browse/SPARK-17289
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong


For the following query:

{code}
val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
"b").createOrReplaceTempView("t2")
spark.sql("select max(b) from t2 group by a").explain(true)
{code}

Now, the SortAggregator won't insert Sort operator before partial aggregation, 
this will break sort-based partial aggregation.

{code}
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
  +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
max#19])
 +- LocalTableScan [a#5, b#6]
{code}

In Spark 2.0 branch, the plan is:
{code}
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
  +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
max#19])
 +- *Sort [a#5 ASC], false, 0
+- LocalTableScan [a#5, b#6]
{code}

This is related with SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used

2016-08-22 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17189:
---
Component/s: SQL

> [MINOR] Looses the interface from UnsafeRow to InternalRow in 
> AggregationIterator if UnsafeRow specific method is not used
> --
>
> Key: SPARK-17189
> URL: https://issues.apache.org/jira/browse/SPARK-17189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used

2016-08-22 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17189:
--

 Summary: [MINOR] Looses the interface from UnsafeRow to 
InternalRow in AggregationIterator if UnsafeRow specific method is not used
 Key: SPARK-17189
 URL: https://issues.apache.org/jira/browse/SPARK-17189
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2016-08-22 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431136#comment-15431136
 ] 

Sean Zhong edited comment on SPARK-16283 at 8/22/16 4:35 PM:
-

Created a sub-task SPARK-17188 to move QuantileSummaries to package 
org.apache.spark.sql.util of catalyst project


was (Author: clockfly):
Created a sub-task to move QuantileSummaries to package 
org.apache.spark.sql.util of catalyst project

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-08-22 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431136#comment-15431136
 ] 

Sean Zhong commented on SPARK-16283:


Created a sub-task to move QuantileSummaries to package 
org.apache.spark.sql.util of catalyst project

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17188:
---
Description: 
QuantileSummaries is a useful utility class to do statistics. It can be used by 
aggregation function like percentile_approx.

Currently, QuantileSummaries is located at project catalyst in package 
org.apache.spark.sql.execution.stat, probably, we should move it to project 
catalyst package org.apache.spark.sql.util.

  was:org.apache.spark.sql.execution.stat


> Moves QuantileSummaries to project catalyst from sql so that it can be used 
> to implement percentile_approx
> --
>
> Key: SPARK-17188
> URL: https://issues.apache.org/jira/browse/SPARK-17188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sean Zhong
>
> QuantileSummaries is a useful utility class to do statistics. It can be used 
> by aggregation function like percentile_approx.
> Currently, QuantileSummaries is located at project catalyst in package 
> org.apache.spark.sql.execution.stat, probably, we should move it to project 
> catalyst package org.apache.spark.sql.util.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17188:
--

 Summary: Moves QuantileSummaries to project catalyst from sql so 
that it can be used to implement percentile_approx
 Key: SPARK-17188
 URL: https://issues.apache.org/jira/browse/SPARK-17188
 Project: Spark
  Issue Type: Sub-task
Reporter: Sean Zhong






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17188:
---
Description: org.apache.spark.sql.execution.stat

> Moves QuantileSummaries to project catalyst from sql so that it can be used 
> to implement percentile_approx
> --
>
> Key: SPARK-17188
> URL: https://issues.apache.org/jira/browse/SPARK-17188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sean Zhong
>
> org.apache.spark.sql.execution.stat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17187:
---
Description: 
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
 -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
 -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee theperformance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.

  was:
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.
2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.
3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:
1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
a. It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
b. We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee the performance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.


> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost 

[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17187:
---
Description: 
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
 -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
 -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee theperformance.

2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.

3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.

  was:
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
 -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
 -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee theperformance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.


> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy 

[jira] [Created] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17187:
--

 Summary: Support using arbitrary Java object as internal 
aggregation buffer object
 Key: SPARK-17187
 URL: https://issues.apache.org/jira/browse/SPARK-17187
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Sean Zhong


*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.
2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.
3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:
1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
a. It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
b. We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee the performance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression

2016-08-12 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17034:
--

 Summary: Ordinal in ORDER BY or GROUP BY should be treated as an 
unresolved expression
 Key: SPARK-17034
 URL: https://issues.apache.org/jira/browse/SPARK-17034
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong


Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" 
should be considered as unresolved before analysis. But in current code, it 
uses "Literal" expression to store the ordinal. This is inappropriate as 
"Literal" itself is a resolved expression, it gives the user a wrong message 
that the ordinals has already been resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16666) Kryo encoder for custom complex classes

2016-08-04 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408855#comment-15408855
 ] 

Sean Zhong commented on SPARK-1:


This issue has been fixed in Spark 2.0 and trunk. Can you use latest code 
instead?
{code}
scala> import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
scala> import org.apache.spark.{SparkConf, SparkContext}
scala> case class Point(a: Int, b: Int)
scala> case class Foo(position: Point, name: String)
scala> implicit val PointEncoder: Encoder[Point] = Encoders.kryo[Point]
scala> implicit val FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
scala> Seq(new Foo(new Point(0, 0), "bar")).toDS.show
++
|   value|
++
|[01 00 24 6C 69 6...|
++
{code}

> Kryo encoder for custom complex classes
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Sam
>
> I'm trying to create a dataset with some geo data using spark and esri. If 
> `Foo` only have `Point` field, it'll work but if I add some other fields 
> beyond a `Point`, I get ArrayIndexOutOfBoundsException.
> {code:scala}
> import com.esri.core.geometry.Point
> import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
> import org.apache.spark.{SparkConf, SparkContext}
> 
> object Main {
> 
>   case class Foo(position: Point, name: String)
> 
>   object MyEncoders {
> implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]
> 
> implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
>   }
> 
>   def main(args: Array[String]): Unit = {
> val sc = new SparkContext(new 
> SparkConf().setAppName("app").setMaster("local"))
> val sqlContext = new SQLContext(sc)
> import MyEncoders.{FooEncoder, PointEncoder}
> import sqlContext.implicits._
> Seq(new Foo(new Point(0, 0), "bar")).toDS.show
>   }
> }
> {code}
> {noformat}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
>  
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>  
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
>  
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
>  
> at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) 
> at 
> org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69)
>  
> at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) 
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) 
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230) 
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193) 
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201) 
> at Main$.main(Main.scala:24) 
> at Main.main(Main.scala)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16666) Kryo encoder for custom complex classes

2016-08-04 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong resolved SPARK-1.

Resolution: Not A Problem

> Kryo encoder for custom complex classes
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Sam
>
> I'm trying to create a dataset with some geo data using spark and esri. If 
> `Foo` only have `Point` field, it'll work but if I add some other fields 
> beyond a `Point`, I get ArrayIndexOutOfBoundsException.
> {code:scala}
> import com.esri.core.geometry.Point
> import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
> import org.apache.spark.{SparkConf, SparkContext}
> 
> object Main {
> 
>   case class Foo(position: Point, name: String)
> 
>   object MyEncoders {
> implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]
> 
> implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
>   }
> 
>   def main(args: Array[String]): Unit = {
> val sc = new SparkContext(new 
> SparkConf().setAppName("app").setMaster("local"))
> val sqlContext = new SQLContext(sc)
> import MyEncoders.{FooEncoder, PointEncoder}
> import sqlContext.implicits._
> Seq(new Foo(new Point(0, 0), "bar")).toDS.show
>   }
> }
> {code}
> {noformat}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
>  
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>  
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
>  
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
>  
> at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) 
> at 
> org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69)
>  
> at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) 
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) 
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230) 
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193) 
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201) 
> at Main$.main(Main.scala:24) 
> at Main.main(Main.scala)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16907) Parquet table reading performance regression when vectorized record reader is not used

2016-08-04 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16907:
--

 Summary: Parquet table reading performance regression when 
vectorized record reader is not used
 Key: SPARK-16907
 URL: https://issues.apache.org/jira/browse/SPARK-16907
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sean Zhong


For this parquet reading benchmark, Spark 2.0 is 20%-30% slower than Spark 1.6.

{code}
// Test Env: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, Intel SSD SC2KW24
// Generates parquet table with nested columns
spark.range(1).select(struct($"id").as("nc")).write.parquet("/tmp/data4")

def time[R](block: => R): Long = {
val t0 = System.nanoTime()
val result = block// call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/100 + "ms")
(t1 - t0)/100
}

val x = ((0 until 20).toList.map(x => 
time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect(.sum/20
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16906) Adds more input type information for TypedAggregateExpression

2016-08-04 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16906:
--

 Summary: Adds more input type information for 
TypedAggregateExpression
 Key: SPARK-16906
 URL: https://issues.apache.org/jira/browse/SPARK-16906
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong
Priority: Minor


For TypedAggregateExpression 
{code}
case class TypedAggregateExpression(
aggregator: Aggregator[Any, Any, Any],
inputDeserializer: Option[Expression],
bufferSerializer: Seq[NamedExpression],
bufferDeserializer: Expression,
outputSerializer: Seq[Expression],
outputExternalType: DataType,
dataType: DataType,
nullable: Boolean) extends DeclarativeAggregate with NonSQLExpression
{code}

Aggregator of TypedAggregateExpression usually contains a closure like: 
{code}
class TypedSumDouble[IN](f: IN => Double) extends Aggregator[IN, Double, Double]
{code}

It will be great if we can add more info in TypedAggregateExpression to 
describe the closure input type {{IN}}, like class, and schema, so that we can 
use this info to make customized optimization like SPARK-14083






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16898) Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn

2016-08-04 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-16898:
---
Summary: Adds argument type information for typed logical plan like 
MapElements, TypedFilter, and AppendColumn  (was: Adds argument type 
information for typed logical plan likMapElements, TypedFilter, and 
AppendColumn)

> Adds argument type information for typed logical plan like MapElements, 
> TypedFilter, and AppendColumn
> -
>
> Key: SPARK-16898
> URL: https://issues.apache.org/jira/browse/SPARK-16898
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Minor
>
> Typed logical plan like MapElements, TypedFilter, and AppendColumn contains a 
> closure field: {{func: (T) => Boolean}}. For example class TypedFilter's 
> signature is:
> {code}
> case class TypedFilter(
> func: AnyRef,
> deserializer: Expression,
> child: LogicalPlan) extends UnaryNode
> {code} 
> From the above class signature, we cannot easily find:
> 1. What is the input argument's type of the closure {{func}}? How do we know 
> which apply method to pick if there are multiple overloaded apply methods?
> 2. What is the input argument's schema? 
> With this info, it is easier for us to define some custom optimizer rule to 
> translate these typed logical plan to more efficient implementation, like the 
> closure optimization idea in SPARK-14083.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16898) Adds argument type information for typed logical plan likMapElements, TypedFilter, and AppendColumn

2016-08-04 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16898:
--

 Summary: Adds argument type information for typed logical plan 
likMapElements, TypedFilter, and AppendColumn
 Key: SPARK-16898
 URL: https://issues.apache.org/jira/browse/SPARK-16898
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong
Priority: Minor


Typed logical plan like MapElements, TypedFilter, and AppendColumn contains a 
closure field: {{func: (T) => Boolean}}. For example class TypedFilter's 
signature is:

{code}
case class TypedFilter(
func: AnyRef,
deserializer: Expression,
child: LogicalPlan) extends UnaryNode
{code} 

>From the above class signature, we cannot easily find:
1. What is the input argument's type of the closure {{func}}? How do we know 
which apply method to pick if there are multiple overloaded apply methods?
2. What is the input argument's schema? 

With this info, it is easier for us to define some custom optimizer rule to 
translate these typed logical plan to more efficient implementation, like the 
closure optimization idea in SPARK-14083.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16888) Implements eval method for expression AssertNotNull

2016-08-03 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16888:
--

 Summary: Implements eval method for expression AssertNotNull
 Key: SPARK-16888
 URL: https://issues.apache.org/jira/browse/SPARK-16888
 Project: Spark
  Issue Type: Improvement
Reporter: Sean Zhong
Priority: Minor


We should support eval() method for expression AssertNotNull

Currently, it reports UnsupportedOperationException when used in projection of 
LocalRelation.

{code}
scala> import org.apache.spark.sql.catalyst.dsl.expressions._
scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
scala> import org.apache.spark.sql.Column
scala> case class A(a: Int)
scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, 
Nil))).explain

java.lang.UnsupportedOperationException: Only code-generated evaluation is 
supported.
  at 
org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
  ...

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-03 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong closed SPARK-16841.
--
Resolution: Not A Problem

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-03 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306
 ] 

Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM:


This jira is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.


was (Author: clockfly):
This jira is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-03 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306
 ] 

Sean Zhong commented on SPARK-16841:


This PR is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-03 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306
 ] 

Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM:


This jira is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.


was (Author: clockfly):
This PR is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405057#comment-15405057
 ] 

Sean Zhong commented on SPARK-16320:


[~maver1ck] Did you use the test case in this jira
{code}
select count(*) where nested_column.id > some_id
{code}

Or the test case in SPARK-16321?
{code}
df = sqlctx.read.parquet(path)
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 else 
[]).collect()
{code}

There is a slight difference. In the first test case, the filter condition uses 
nested fields, the filter push down is not supported. So, your PR 14465 should 
not be effective. 

Basically SPARK-16320 and SPARK-16321 are different problems.




> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16853) Analysis error for DataSet typed selection

2016-08-02 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16853:
--

 Summary: Analysis error for DataSet typed selection
 Key: SPARK-16853
 URL: https://issues.apache.org/jira/browse/SPARK-16853
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong


For DataSet typed selection
{code}
def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
{code}

If U1 contains sub-fields, then it reports AnalysisException 

Reproducer:
{code}
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input 
columns: [_2];
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403238#comment-15403238
 ] 

Sean Zhong edited comment on SPARK-16320 at 8/2/16 2:22 AM:


[~maver1ck] Can you check whether the PR works for you?



was (Author: clockfly):
[~loziniak] Can you check whether the PR works for you?


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403238#comment-15403238
 ] 

Sean Zhong commented on SPARK-16320:


[~loziniak] Can you check whether the PR works for you?


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-01 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-16841:
---
Summary: Improves the row level metrics performance when reading Parquet 
table  (was: Improve the row level metrics performance when reading Parquet 
table)

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16841) Improve the row level metrics performance when reading Parquet table

2016-08-01 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16841:
--

 Summary: Improve the row level metrics performance when reading 
Parquet table
 Key: SPARK-16841
 URL: https://issues.apache.org/jira/browse/SPARK-16841
 Project: Spark
  Issue Type: Improvement
Reporter: Sean Zhong


When reading Parquet table, Spark adds row level metrics like recordsRead, 
bytesRead 
(https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).

The implementation is not very efficient. When parquet vectorized reader is not 
used, it may take 20% of read time to update these metrics.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12437) Reserved words (like table) throws error when writing a data frame to JDBC

2016-07-19 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong resolved SPARK-12437.

Resolution: Duplicate

> Reserved words (like table) throws error when writing a data frame to JDBC
> --
>
> Key: SPARK-12437
> URL: https://issues.apache.org/jira/browse/SPARK-12437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: starter
>
> From: A Spark user
> If you have a DataFrame column name that contains a SQL reserved word, it 
> will not write to a JDBC source.  This is somewhat similar to an error found 
> in the redshift adapter:
> https://github.com/databricks/spark-redshift/issues/80
> I have produced this on a MySQL (AWS Aurora) database
> Steps to reproduce:
> {code}
> val connectionProperties = new java.util.Properties()
> sqlContext.table("diamonds").write.jdbc(jdbcUrl, "diamonds", 
> connectionProperties)
> {code}
> Exception:
> {code}
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'table DOUBLE PRECISION , price 
> INTEGER , x DOUBLE PRECISION , y DOUBLE PRECISION' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
>   at com.mysql.jdbc.Util.getInstance(Util.java:386)
>   at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1054)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4237)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4169)
>   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2617)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2778)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2825)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2156)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2459)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2376)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2360)
>   at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:275)
> {code}
> You can workaround this by renaming the column on the dataframe before 
> writing, but ideally we should be able to do something like encapsulate the 
> name in quotes which is allowed.  Example: 
> {code}
> CREATE TABLE `test_table_column` (
>   `id` int(11) DEFAULT NULL,
>   `table` varchar(100) DEFAULT NULL
> ) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12437) Reserved words (like table) throws error when writing a data frame to JDBC

2016-07-19 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385148#comment-15385148
 ] 

Sean Zhong commented on SPARK-12437:


This issue is fixed by SPARK-16387.
The column names are quoted automatically. The jdbc table name can be quoted 
manually by the end user like this: spark.table("good2").write.jdbc(url, 
"`table`")

The following example can run successfully:
{code}
// Here we have a special column name "table"
Seq((1,2),(2,3)).toDF("table", "b").createOrReplaceTempView("good2")
val connectionProperties = new java.util.Properties()
val url = "jdbc:mysql://127.0.0.1:3306/test?password=xx=root=false"
// Here we have a special table name "table"
spark.table("good2").write.jdbc(url, "`table`", connectionProperties)
{code}



> Reserved words (like table) throws error when writing a data frame to JDBC
> --
>
> Key: SPARK-12437
> URL: https://issues.apache.org/jira/browse/SPARK-12437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: starter
>
> From: A Spark user
> If you have a DataFrame column name that contains a SQL reserved word, it 
> will not write to a JDBC source.  This is somewhat similar to an error found 
> in the redshift adapter:
> https://github.com/databricks/spark-redshift/issues/80
> I have produced this on a MySQL (AWS Aurora) database
> Steps to reproduce:
> {code}
> val connectionProperties = new java.util.Properties()
> sqlContext.table("diamonds").write.jdbc(jdbcUrl, "diamonds", 
> connectionProperties)
> {code}
> Exception:
> {code}
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'table DOUBLE PRECISION , price 
> INTEGER , x DOUBLE PRECISION , y DOUBLE PRECISION' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
>   at com.mysql.jdbc.Util.getInstance(Util.java:386)
>   at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1054)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4237)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4169)
>   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2617)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2778)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2825)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2156)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2459)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2376)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2360)
>   at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:275)
> {code}
> You can workaround this by renaming the column on the dataframe before 
> writing, but ideally we should be able to do something like encapsulate the 
> name in quotes which is allowed.  Example: 
> {code}
> CREATE TABLE `test_table_column` (
>   `id` int(11) DEFAULT NULL,
>   `table` varchar(100) DEFAULT NULL
> ) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16323) Avoid unnecessary cast when doing integral divide

2016-06-30 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16323:
--

 Summary: Avoid unnecessary cast when doing integral divide
 Key: SPARK-16323
 URL: https://issues.apache.org/jira/browse/SPARK-16323
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Sean Zhong
Priority: Minor


This is a follow up of issue SPARK-15776

*Problem:*

For Integer divide operator div:
{code}
scala> spark.sql("select 6 div 3").explain(true)
...
== Analyzed Logical Plan ==
CAST((6 / 3) AS BIGINT): bigint
Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 
3) AS BIGINT)#5L]
+- OneRowRelation$
...
{code}

For performance reason, we should not do unnecessary cast {{cast(xx as double)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16322) Avoids unnecessary cast when doing integral divide

2016-06-30 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16322:
--

 Summary: Avoids unnecessary cast when doing integral divide
 Key: SPARK-16322
 URL: https://issues.apache.org/jira/browse/SPARK-16322
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Sean Zhong
Priority: Minor


This is a follow up of https://databricks.atlassian.net/browse/SC-3588

*Problem:*

For Integer divide operator {{div}}:
{code}
scala> spark.sql("select 6 div 3").explain(true)
...
== Analyzed Logical Plan ==
CAST((6 / 3) AS BIGINT): bigint
Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 
3) AS BIGINT)#5L]
+- OneRowRelation$
...
{code}

For performance reason, we should not do unnecessary {{cast(xx as double)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2016-06-28 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353969#comment-15353969
 ] 

Sean Zhong edited comment on SPARK-14083 at 6/29/16 12:17 AM:
--

For typed operation like map, it will first de-serialize InternalRow to type T, 
apply the operation, and then serialize T back to InternalRow.
For un-typed operation like select("column"), it directly operates on 
InternalRow.

If end user defines a custom serializer like Kryo, then it is not possible to 
map typed operation to un-typed operation.


{code}
scala> case class C(c: Int)
scala> val ds: Dataset[C] = Seq(C(1)).toDS
scala> ds.select("c")   // <- Return correct result when using default encoder.
res1: org.apache.spark.sql.DataFrame = [c: int]

scala> implicit val encoder: Encoder[C] = Encoders.kryo[C]   // <- Define a 
Kryo encoder
scala> val ds2: Dataset[C] = Seq(C(1)).toDS
ds2: org.apache.spark.sql.Dataset[C] = [value: binary]  // <- Row is encoded as 
binary by using Kryo encoder

scala> ds2.select("c")  // <- Fails even if "c" is an existing field in class C!
org.apache.spark.sql.AnalysisException: cannot resolve '`c`' given input 
columns: [value];
  ...

{code}


was (Author: clockfly):

For typed operation like map, it will first de-serialize InternalRow to type T, 
apply the operation, and then serialize T back to InternalRow.
For un-typed operation like select("column"), it directly operates on 
InternalRow.

If end user defines a custom serializer like Kryo, then it is not possible to 
map typed operation to un-typed operation.


```
scala> case class C(c: Int)
scala> val ds: Dataset[C] = Seq(C(1)).toDS
scala> ds.select("c")   // <- Return correct result when using default encoder.
res1: org.apache.spark.sql.DataFrame = [c: int]

scala> implicit val encoder: Encoder[C] = Encoders.kryo[C]   // <- Define a 
Kryo encoder
scala> val ds2: Dataset[C] = Seq(C(1)).toDS
ds2: org.apache.spark.sql.Dataset[C] = [value: binary]  // <- Row is encoded as 
binary by using Kryo encoder

scala> ds2.select("c")  // <- Fails even if "c" is an existing field in class C!
org.apache.spark.sql.AnalysisException: cannot resolve '`c`' given input 
columns: [value];
  ...

```

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2016-06-28 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353969#comment-15353969
 ] 

Sean Zhong commented on SPARK-14083:



For typed operation like map, it will first de-serialize InternalRow to type T, 
apply the operation, and then serialize T back to InternalRow.
For un-typed operation like select("column"), it directly operates on 
InternalRow.

If end user defines a custom serializer like Kryo, then it is not possible to 
map typed operation to un-typed operation.


```
scala> case class C(c: Int)
scala> val ds: Dataset[C] = Seq(C(1)).toDS
scala> ds.select("c")   // <- Return correct result when using default encoder.
res1: org.apache.spark.sql.DataFrame = [c: int]

scala> implicit val encoder: Encoder[C] = Encoders.kryo[C]   // <- Define a 
Kryo encoder
scala> val ds2: Dataset[C] = Seq(C(1)).toDS
ds2: org.apache.spark.sql.Dataset[C] = [value: binary]  // <- Row is encoded as 
binary by using Kryo encoder

scala> ds2.select("c")  // <- Fails even if "c" is an existing field in class C!
org.apache.spark.sql.AnalysisException: cannot resolve '`c`' given input 
columns: [value];
  ...

```

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable

2016-06-17 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-16034:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-16032

> Checks the partition columns when calling 
> dataFrame.write.mode("append").saveAsTable
> 
>
> Key: SPARK-16034
> URL: https://issues.apache.org/jira/browse/SPARK-16034
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Sean Zhong
>
> Suppose we have defined a partitioned table:
> {code}
> CREATE TABLE src (a INT, b INT, c INT)
> USING PARQUET
> PARTITIONED BY (a, b);
> {code}
> We should check the partition columns when appending DataFrame data to 
> existing table: 
> {code}
> val df = Seq((1, 2, 3)).toDF("a", "b", "c")
> df.write.partitionBy("b", "a").mode("append").saveAsTable("src")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable

2016-06-17 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16034:
--

 Summary: Checks the partition columns when calling 
dataFrame.write.mode("append").saveAsTable
 Key: SPARK-16034
 URL: https://issues.apache.org/jira/browse/SPARK-16034
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong


Suppose we have defined a partitioned table:
{code}
CREATE TABLE src (a INT, b INT, c INT)
USING PARQUET
PARTITIONED BY (a, b);
{code}

We should check the partition columns when appending DataFrame data to existing 
table: 
{code}
val df = Seq((1, 2, 3)).toDF("a", "b", "c")
df.write.partitionBy("b", "a").mode("append").saveAsTable("src")
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15340) Limit the size of the map used to cache JobConfs to void OOM

2016-06-17 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336511#comment-15336511
 ] 

Sean Zhong commented on SPARK-15340:


[~DoingDone9]

I did some tests, and didn't see the OOM you observed. Can you elaborate more 
on how the OOM happens?

1. What is the error message and stack trace when the OOM happens?
2. Are you running Apache Spark in single machine mode or cluster mode?
3. What is the driver configuration settings?
4. Is it a Permgen (Metaspace) OOM or heap space OOM?
5. Which client are you using to connect to the thrift JDBC server?
6. Do you have a reproducible script so that we can try to reproduce it on our 
environment?
7. Is your Spark deployment running on YARN?


 

> Limit the size of the map used to cache JobConfs to void OOM
> 
>
> Key: SPARK-15340
> URL: https://issues.apache.org/jira/browse/SPARK-15340
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Zhongshuai Pei
>Priority: Critical
>
> when i run tpcds (orc)  by using JDBCServer, driver always  OOM.
> i find tens of thousands of Jobconf from dump file and these JobConf  can not 
> be recycled, So we should limit the size of the map used to cache JobConfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-16 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335174#comment-15335174
 ] 

Sean Zhong commented on SPARK-14048:


[~simeons]  

You can use {{sqlContext.sql("query")}} instead of {{%sql}} to get around this 
issue as a workaround.

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow

2016-06-16 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334743#comment-15334743
 ] 

Sean Zhong commented on SPARK-15786:


[~yhuai] Sure, we definitely can improve it.

> joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
> -
>
> Key: SPARK-15786
> URL: https://issues.apache.org/jira/browse/SPARK-15786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Richard Marscher
>Assignee: Sean Zhong
> Fix For: 2.0.0
>
>
> {code}java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 36, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates 
> are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", 
> "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, 
> int)"{code}
> I have been trying to use joinWith along with Option data types to get an 
> approximation of the RDD semantics for outer joins with Dataset to have a 
> nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode 
> generation trying to pass an InternalRow object into the ByteBuffer.wrap 
> function which expects byte[] with or without a couple int qualifiers.
> I have a notebook reproducing this against 2.0 preview in Databricks 
> Community Edition: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-16 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334657#comment-15334657
 ] 

Sean Zhong commented on SPARK-14048:


[~simeons]  

I can now reproduce this on Databricks community edition by changing above 
notebook script to:

{code}
val rdd = sc.makeRDD(
  """{"st": {"x.y": 1}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 10}""" :: 
"""{"st": {"x.y": 2}, "age": 20}""" :: Nil)
sqlContext.read.json(rdd).registerTempTable("test")
%sql select first(st) as st from test group by age
{code}

Thanks! I will post the updates later.

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-16 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334644#comment-15334644
 ] 

Sean Zhong commented on SPARK-14048:


[~simeons]  

Can you share a complete notebook which we can run and reproduce the problem 
you saw? For example, the file {{include/init_scala}} is missed in your 
notebook.

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-16 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333221#comment-15333221
 ] 

Sean Zhong commented on SPARK-14048:


[~simeons]  

Can you try the following script on your environment and see if it reports 
error?

{code}
val rdd = sc.makeRDD(
  """{"st": {"x.y": 1}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 10}""" :: 
"""{"st": {"x.y": 2}, "age": 20}""" :: Nil)
sqlContext.read.json(rdd).registerTempTable("test")
sqlContext.sql("select first(st) as st from test group by age").show()
{code}

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow

2016-06-16 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333205#comment-15333205
 ] 

Sean Zhong commented on SPARK-15786:


Hi [~rmarscher] 

The reason is that you use the Kryo encoder in a wrong way, you cannot cast 
{{Dataset\[A\]}} to {{Dataset\[B\]}} using a Kryo encoder. 
In Spark 2.0, we now have a stricter rule to detect this kind of cast failure 
after PR https://github.com/apache/spark/pull/13632. 

Now, it reports error like this:
{code}
scala> ds.as[(Option[(Int, Int)], Option[(Int, Int)])]
org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`_1` AS BINARY)' 
due to data type mismatch: cannot cast 
StructType(StructField(_1,IntegerType,false), 
StructField(_2,IntegerType,false)) to BinaryType;
{code}


If you remove all related Kryo lines in your test code, then no exception 
should be thrown, your code should work fine.


> joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
> -
>
> Key: SPARK-15786
> URL: https://issues.apache.org/jira/browse/SPARK-15786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Richard Marscher
>
> {code}java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 36, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates 
> are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", 
> "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, 
> int)"{code}
> I have been trying to use joinWith along with Option data types to get an 
> approximation of the RDD semantics for outer joins with Dataset to have a 
> nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode 
> generation trying to pass an InternalRow object into the ByteBuffer.wrap 
> function which expects byte[] with or without a couple int qualifiers.
> I have a notebook reproducing this against 2.0 preview in Databricks 
> Community Edition: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-15 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332775#comment-15332775
 ] 

Sean Zhong commented on SPARK-14048:


[~simeons]  

Are you able to reproduce this case any longer? I cannot reproduce this on 1.6 
by using the following script on databricks cloud community edition.

{code}
val rdd = sc.makeRDD(
  """{"st": {"x.y": 1}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 10}""" :: 
"""{"st": {"x.y": 2}, "age": 20}""" :: Nil)
sqlContext.read.json(rdd).registerTempTable("test")
sqlContext.sql("select first(st) as st from test group by age").show()
{code}

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow

2016-06-15 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332532#comment-15332532
 ] 

Sean Zhong commented on SPARK-15786:


The exception stack is:
{code}
scala> res4.as[(Option[(Int, Int)], Option[(Int, Int)])].collect()
16/06/15 13:46:18 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, 
Column 109: No applicable constructor/method found for actual parameters 
"org.apache.spark.sql.catalyst.InternalRow"; candidates are: "public static 
java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", "public static 
java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, int)"
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private MutableRow mutableRow;
/* 009 */   private org.apache.spark.serializer.KryoSerializerInstance 
serializer;
/* 010 */   private org.apache.spark.serializer.KryoSerializerInstance 
serializer1;
/* 011 */
/* 012 */
/* 013 */   public SpecificSafeProjection(Object[] references) {
/* 014 */ this.references = references;
/* 015 */ mutableRow = (MutableRow) references[references.length - 1];
/* 016 */
/* 017 */ if (org.apache.spark.SparkEnv.get() == null) {
/* 018 */   serializer = 
(org.apache.spark.serializer.KryoSerializerInstance) new 
org.apache.spark.serializer.KryoSerializer(new 
org.apache.spark.SparkConf()).newInstance();
/* 019 */ } else {
/* 020 */   serializer = 
(org.apache.spark.serializer.KryoSerializerInstance) new 
org.apache.spark.serializer.KryoSerializer(org.apache.spark.SparkEnv.get().conf()).newInstance();
/* 021 */ }
/* 022 */
/* 023 */
/* 024 */ if (org.apache.spark.SparkEnv.get() == null) {
/* 025 */   serializer1 = 
(org.apache.spark.serializer.KryoSerializerInstance) new 
org.apache.spark.serializer.KryoSerializer(new 
org.apache.spark.SparkConf()).newInstance();
/* 026 */ } else {
/* 027 */   serializer1 = 
(org.apache.spark.serializer.KryoSerializerInstance) new 
org.apache.spark.serializer.KryoSerializer(org.apache.spark.SparkEnv.get().conf()).newInstance();
/* 028 */ }
/* 029 */
/* 030 */   }
/* 031 */
/* 032 */   public java.lang.Object apply(java.lang.Object _i) {
/* 033 */ InternalRow i = (InternalRow) _i;
/* 034 */
/* 035 */ boolean isNull2 = i.isNullAt(0);
/* 036 */ InternalRow value2 = isNull2 ? null : (i.getStruct(0, 2));
/* 037 */ final scala.Option value1 = isNull2 ? null : (scala.Option) 
serializer.deserialize(java.nio.ByteBuffer.wrap(value2), null);
/* 038 */
/* 039 */ boolean isNull4 = i.isNullAt(1);
/* 040 */ InternalRow value4 = isNull4 ? null : (i.getStruct(1, 2));
/* 041 */ final scala.Option value3 = isNull4 ? null : (scala.Option) 
serializer1.deserialize(java.nio.ByteBuffer.wrap(value4), null);
/* 042 */
/* 043 */
/* 044 */ final scala.Tuple2 value = false ? null : new 
scala.Tuple2(value1, value3);
/* 045 */ if (false) {
/* 046 */   mutableRow.setNullAt(0);
/* 047 */ } else {
/* 048 */
/* 049 */   mutableRow.update(0, value);
/* 050 */ }
/* 051 */
/* 052 */ return mutableRow;
/* 053 */   }
/* 054 */ }

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, 
Column 109: No applicable constructor/method found for actual parameters 
"org.apache.spark.sql.catalyst.InternalRow"; candidates are: "public static 
java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", "public static 
java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, int)"
at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
at 
org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7559)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7429)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7333)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5663)
at org.codehaus.janino.UnitCompiler.access$13800(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:5132)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3971)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5159)
at 
org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7533)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7429)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7333)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3873)
at 

[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow

2016-06-15 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332527#comment-15332527
 ] 

Sean Zhong commented on SPARK-15786:


Basically, what you described can be shorten to:

{code}
scala> val ds = Seq((1,1) -> (1, 1)).toDS()
res4: org.apache.spark.sql.Dataset[((Int, Int), (Int, Int))] = [_1: struct<_1: 
int, _2: int>, _2: struct<_1: int, _2: int>]

scala> implicit val enc = Encoders.tuple(Encoders.kryo[Option[(Int, Int)]], 
Encoders.kryo[Option[(Int, Int)]])
enc: org.apache.spark.sql.Encoder[(Option[(Int, Int)], Option[(Int, Int)])] = 
class[_1[0]: binary, _2[0]: binary]

scala> ds.as[(Option[(Int, Int)], Option[(Int, Int)])].collect()
{code}

> joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
> -
>
> Key: SPARK-15786
> URL: https://issues.apache.org/jira/browse/SPARK-15786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Richard Marscher
>
> {code}java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 36, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates 
> are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", 
> "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, 
> int)"{code}
> I have been trying to use joinWith along with Option data types to get an 
> approximation of the RDD semantics for outer joins with Dataset to have a 
> nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode 
> generation trying to pass an InternalRow object into the ByteBuffer.wrap 
> function which expects byte[] with or without a couple int qualifiers.
> I have a notebook reproducing this against 2.0 preview in Databricks 
> Community Edition: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for source code backward compatiblity

2016-06-13 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-15914:
---
Description: 
We removed some deprecated method in SQLContext in branch Spark 2.0.

For example:
{code}
  @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
  def jsonFile(path: String): DataFrame = {
read.json(path)
  }
{code}

These deprecated method may be used by existing third party data source. We 
probably want to add them back to remain source code level backward 
compatibility. 

  was:
We removed some deprecated method in SQLContext in branch Spark 2.0.

For example:
{code}
  @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
  def jsonFile(path: String): DataFrame = {
read.json(path)
  }
{code}

These deprecated method may be used by existing third party data source. We 
probably want to add them back to remain backward-compatibiity. 


> Add deprecated method back to SQLContext for source code backward compatiblity
> --
>
> Key: SPARK-15914
> URL: https://issues.apache.org/jira/browse/SPARK-15914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> We removed some deprecated method in SQLContext in branch Spark 2.0.
> For example:
> {code}
>   @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
>   def jsonFile(path: String): DataFrame = {
> read.json(path)
>   }
> {code}
> These deprecated method may be used by existing third party data source. We 
> probably want to add them back to remain source code level backward 
> compatibility. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for source code backward compatiblity

2016-06-13 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-15914:
---
Summary: Add deprecated method back to SQLContext for source code backward 
compatiblity  (was: Add deprecated method back to SQLContext for backward 
compatiblity)

> Add deprecated method back to SQLContext for source code backward compatiblity
> --
>
> Key: SPARK-15914
> URL: https://issues.apache.org/jira/browse/SPARK-15914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> We removed some deprecated method in SQLContext in branch Spark 2.0.
> For example:
> {code}
>   @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
>   def jsonFile(path: String): DataFrame = {
> read.json(path)
>   }
> {code}
> These deprecated method may be used by existing third party data source. We 
> probably want to add them back to remain backward-compatibiity. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for backward compatiblity

2016-06-13 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-15914:
---
Summary: Add deprecated method back to SQLContext for backward compatiblity 
 (was: Add deprecated method back to SQLContext for compatiblity)

> Add deprecated method back to SQLContext for backward compatiblity
> --
>
> Key: SPARK-15914
> URL: https://issues.apache.org/jira/browse/SPARK-15914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> We removed some deprecated method in SQLContext in branch Spark 2.0.
> For example:
> {code}
>   @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
>   def jsonFile(path: String): DataFrame = {
> read.json(path)
>   }
> {code}
> These deprecated method may be used by existing third party data source. We 
> probably want to add them back to remain backward-compatibiity. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15914) Add deprecated method back to SQLContext for compatiblity

2016-06-13 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-15914:
--

 Summary: Add deprecated method back to SQLContext for compatiblity
 Key: SPARK-15914
 URL: https://issues.apache.org/jira/browse/SPARK-15914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sean Zhong


We removed some deprecated method in SQLContext in branch Spark 2.0.

For example:
{code}
  @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
  def jsonFile(path: String): DataFrame = {
read.json(path)
  }
{code}

These deprecated method may be used by existing third party data source. We 
probably want to add them back to remain backward-compatibiity. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder

2016-06-12 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-15910:
--

 Summary: Schema is not checked when converting DataFrame to 
Dataset using Kryo encoder
 Key: SPARK-15910
 URL: https://issues.apache.org/jira/browse/SPARK-15910
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sean Zhong


Here is the case to reproduce it:

{code}
scala> import org.apache.spark.sql.Encoders._
scala> import org.apache.spark.sql.Encoders
scala> import org.apache.spark.sql.Encoder

scala> case class B(b: Int)

scala> implicit val encoder = Encoders.kryo[B]
encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary]

scala> val ds = Seq((1)).toDF("b").as[B].map(identity)
ds: org.apache.spark.sql.Dataset[B] = [value: binary]

scala> ds.show()
16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, 
Column 168: No applicable constructor/method found for actual parameters "int"; 
candidates are: "public static java.nio.ByteBuffer 
java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer 
java.nio.ByteBuffer.wrap(byte[], int, int)"
...
{code}

The expected behavior is to report schema check failure earlier when creating 
Dataset using {code}dataFrame.as[B]{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15792) [SQL] Allows operator to change the verbosity in explain output.

2016-06-06 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-15792:
--

 Summary: [SQL] Allows operator to change the verbosity in explain 
output.
 Key: SPARK-15792
 URL: https://issues.apache.org/jira/browse/SPARK-15792
 Project: Spark
  Issue Type: Improvement
Reporter: Sean Zhong
Priority: Minor


We should allows an operator (Physical plan or logical plan) to change 
verbosity in explain output.

For example, we may not want to display {{output=[count(a)#48L]}} in 
less-verbose mode.

{code}
scala> spark.sql("select count(a) from df").explain()

== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)], output=[count(a)#48L])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#50L])
  +- LocalTableScan
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15632) Dataset typed filter operation changes query plan schema

2016-06-06 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316998#comment-15316998
 ] 

Sean Zhong commented on SPARK-15632:


*Root cause analysis:*

*The root cause is that the implementation of {{dataset.as\[B\]}} is wrong.*

{code}
scala> val df = Seq((1,2,3)).toDF("a", "b", "c")
scala> case class B(b: Int)
scala> val b = df.as[B]
{code}

We expect b to be a "view" of underlying df, which *SHOULD* only returns *ONE* 
column, but now b contains all the data.

*What we expect:*

{code}
scala> b.show()
+---+
|  b|
+---+
|  2|
+---+
{code}

*What it behaves now:*

{code}
scala> b.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+
{code}

> Dataset typed filter operation changes query plan schema
> 
>
> Key: SPARK-15632
> URL: https://issues.apache.org/jira/browse/SPARK-15632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiang Zhong
>
> h1. Overview
> Filter operations should never change query plan schema. However, Dataset 
> typed filter operation does introduce schema change in some cases. 
> Furthermore, all the following aspects of the schema may be changed:
> # field order,
> # field number,
> # field data type,
> # field name, and
> # field nullability
> This is mostly because we wrap the actual {{Filter}} operator with a 
> {{SerializeFromObject}}/{{DeserializeToObject}} pair (query plan fragment 
> illustrated as following), which performs a bunch of magic tricks.
> {noformat}
> SerializeFromObject
>  Filter
>   DeserializeToObject
>
> {noformat}
> h1. Reproduction
> h2. Field order, field number, and field data type change
> {code}
> case class A(b: Double, a: String)
> val data = Seq(
>   "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
>   "{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
>   "{ 'a': 'bar', 'c': 'extra' }"
> )
> val df1 = spark.read.json(sc.parallelize(data))
> df1.printSchema()
> // root
> //  |-- a: string (nullable = true)
> //  |-- b: long (nullable = true)
> //  |-- c: string (nullable = true)
> val ds1 = df1.as[A]
> ds1.printSchema()
> // root
> //  |-- a: string (nullable = true)
> //  |-- b: long (nullable = true)
> //  |-- c: string (nullable = true)
> val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker
> ds2.printSchema()
> // root <- 1. Reordered `a` and `b`, and
> //  |-- b: double (nullable = true)2. dropped `c`, and
> //  |-- a: string (nullable = true)3. up-casted `b` from long to double
> val df2 = ds2.toDF()
> df2.printSchema()
> // root <- (Same as above)
> //  |-- b: double (nullable = true)
> //  |-- a: string (nullable = true)
> {code}
> h3. Field order change
> {{DeserializeToObject}} resolves the encoder deserializer expression by 
> *name*. Thus field order in input query plan doesn't matter.
> h3. Field number change
> Same as above, fields not referred by the encoder are silently dropped while 
> resolving deserializer expressions by name.
> h3. Field data type change
> When generating deserializer expressions, we allows "sane" implicit coercions 
> (e.g. integer to long, and long to double) by inserting {{UpCast}} operators. 
> Thus actual field data types in input query plan don't matter either as long 
> as there are valid implicit coercions.
> h2. Field name and nullability change
> {code}
> val ds3 = spark.range(10)
> ds3.printSchema()
> // root
> //  |-- id: long (nullable = false)
> val ds4 = ds3.filter(_ > 3)
> ds4.printSchema()
> // root
> //  |-- value: long (nullable = true)  4. Name changed from `id` to `value`, 
> and
> // 5. nullability changed from false to 
> true
> {code}
> h3. Field name change
> Primitive encoders like {{Encoder\[Long\]}} doesn't have a named field, thus 
> they always has only a single field with hard-coded name "value". On the 
> other hand, when serializing domain objects back to rows, schema of 
> {{SerializeFromObject}} is solely determined by the encoder. Thus the 
> original name "id" becomes "value".
> h3. Nullability change
> [PR #11880|https://github.com/apache/spark/pull/11880] updated return type of 
> {{SparkSession.range}} from {{Dataset\[Long\]}} to 
> {{Dataset\[java.lang.Long\]}} due to 
> [SI-4388|https://issues.scala-lang.org/browse/SI-4388]. As a consequence, 
> although the underlying {{Range}} operator produces non-nullable output, the 
> result encoder is nullable since {{java.lang.Long}} is nullable. Thus, we 
> observe nullability change after typed filtering because serializer 
> expression is derived from encoder rather than the query plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-15632) Dataset typed filter operation changes query plan schema

2016-06-06 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316996#comment-15316996
 ] 

Sean Zhong commented on SPARK-15632:


There are more issues linked with this bug which we may to fix altogether. 

CASE1: MAP(identity) changed the table content and schema:

As you know, identity is a scala function which basically means doing nothing.

{code}
scala> val df = Seq((1,2,3)).toDF("a", "b", "c")
scala> case class B(b: Int)
scala> val b = df.as[B]
scala> b.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+


scala> b.map(identity).show()
+---+
|  b|
+---+
|  2|
+---+
{code}

CASE2: Filter changed the column name:

Filter changed the column name from id to value.
{code}
scala> val df = spark.range(1,3)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.show()
+---+
| id|
+---+
|  1|
|  2|
+---+


scala> df.filter(_ => true).show()
+-+
|value|
+-+
|1|
|2|
+-+
{code}

CASE3: Able to select a non-existing field:

{code}
scala> val df = Seq((1,2,3)).toDF("a", "b", "c")
scala> case class B(b: Int)
scala> val b = df.as[B]
scala> b.getClass
res7: Class[_ <: org.apache.spark.sql.Dataset[B]] = class 
org.apache.spark.sql.Dataset

scala> b.select("c")   // "c" is NOT a valid field name of class B.
scala> b.select("c").show()
+---+
|  c|
+---+
|  3|
+---+
{code}


> Dataset typed filter operation changes query plan schema
> 
>
> Key: SPARK-15632
> URL: https://issues.apache.org/jira/browse/SPARK-15632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiang Zhong
>
> h1. Overview
> Filter operations should never change query plan schema. However, Dataset 
> typed filter operation does introduce schema change in some cases. 
> Furthermore, all the following aspects of the schema may be changed:
> # field order,
> # field number,
> # field data type,
> # field name, and
> # field nullability
> This is mostly because we wrap the actual {{Filter}} operator with a 
> {{SerializeFromObject}}/{{DeserializeToObject}} pair (query plan fragment 
> illustrated as following), which performs a bunch of magic tricks.
> {noformat}
> SerializeFromObject
>  Filter
>   DeserializeToObject
>
> {noformat}
> h1. Reproduction
> h2. Field order, field number, and field data type change
> {code}
> case class A(b: Double, a: String)
> val data = Seq(
>   "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
>   "{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
>   "{ 'a': 'bar', 'c': 'extra' }"
> )
> val df1 = spark.read.json(sc.parallelize(data))
> df1.printSchema()
> // root
> //  |-- a: string (nullable = true)
> //  |-- b: long (nullable = true)
> //  |-- c: string (nullable = true)
> val ds1 = df1.as[A]
> ds1.printSchema()
> // root
> //  |-- a: string (nullable = true)
> //  |-- b: long (nullable = true)
> //  |-- c: string (nullable = true)
> val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker
> ds2.printSchema()
> // root <- 1. Reordered `a` and `b`, and
> //  |-- b: double (nullable = true)2. dropped `c`, and
> //  |-- a: string (nullable = true)3. up-casted `b` from long to double
> val df2 = ds2.toDF()
> df2.printSchema()
> // root <- (Same as above)
> //  |-- b: double (nullable = true)
> //  |-- a: string (nullable = true)
> {code}
> h3. Field order change
> {{DeserializeToObject}} resolves the encoder deserializer expression by 
> *name*. Thus field order in input query plan doesn't matter.
> h3. Field number change
> Same as above, fields not referred by the encoder are silently dropped while 
> resolving deserializer expressions by name.
> h3. Field data type change
> When generating deserializer expressions, we allows "sane" implicit coercions 
> (e.g. integer to long, and long to double) by inserting {{UpCast}} operators. 
> Thus actual field data types in input query plan don't matter either as long 
> as there are valid implicit coercions.
> h2. Field name and nullability change
> {code}
> val ds3 = spark.range(10)
> ds3.printSchema()
> // root
> //  |-- id: long (nullable = false)
> val ds4 = ds3.filter(_ > 3)
> ds4.printSchema()
> // root
> //  |-- value: long (nullable = true)  4. Name changed from `id` to `value`, 
> and
> // 5. nullability changed from false to 
> true
> {code}
> h3. Field name change
> Primitive encoders like {{Encoder\[Long\]}} doesn't have a named field, thus 
> they always has only a single field with hard-coded name "value". On the 
> other hand, when serializing domain objects back to rows, schema of 
> {{SerializeFromObject}} is solely determined by the encoder. Thus the 
> original name "id" becomes "value".
> h3. Nullability 

[jira] [Created] (SPARK-15734) Avoids printing internal row in explain output

2016-06-02 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-15734:
--

 Summary: Avoids printing internal row in explain output
 Key: SPARK-15734
 URL: https://issues.apache.org/jira/browse/SPARK-15734
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Sean Zhong
Priority: Minor


Avoids printing internal rows in explain out

In explain output, some operator may prints internal Rows. For example, for 
LocalRelation, the explain output may looks like this:

{code}
scala> (1 to 10).toSeq.map(_ => (1,2,3)).toDF().createTempView("df3")
scala> spark.sql("select * from df3 where 1=2").explain(true)
...

== Analyzed Logical Plan ==
_1: int, _2: int, _3: int
Project [_1#37,_2#38,_3#39]
+- Filter (1 = 2)
   +- SubqueryAlias df3
  +- LocalRelation [_1#37,_2#38,_3#39], 
[[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]]
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15733) Makes the explain output less verbose by hiding some verbose output like None, null, empty List, and etc..

2016-06-02 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-15733:
--

 Summary: Makes the explain output less verbose by hiding some 
verbose output like None, null, empty List, and etc..
 Key: SPARK-15733
 URL: https://issues.apache.org/jira/browse/SPARK-15733
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Sean Zhong
Priority: Minor


In the output of {{dataframe.explain()}}, we often see verbose string pattern 
like {{None, [], {}, null, Some(tableName)}}.

For example, for {{spark.sql("show tables").explain()}}, we get:

{code}
== Physical Plan ==
ExecutedCommand
:  +- ShowTablesCommand None, None
{code}

It will be easier to read if we can optimize it to:
{code}
== Physical Plan ==
ExecutedCommand
:  +- ShowTablesCommand
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15495) Improve the output of explain for aggregate operator

2016-06-01 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-15495:
---
Description: 
We should improves the explain output of Aggregator operator to make it more 
readable.

For example:

rename TungstenAggregate to Aggregate 


  was:
The current output of explain has several problems:
1. the output format is not consistent.
2. Some key information is missing.
3. For some nodes, the output message is too verbose.
4. We don't know the output fields of each physical fields.
and ...

I'll add more comments later.


> Improve the output of explain for aggregate operator
> 
>
> Key: SPARK-15495
> URL: https://issues.apache.org/jira/browse/SPARK-15495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
>
> We should improves the explain output of Aggregator operator to make it more 
> readable.
> For example:
> rename TungstenAggregate to Aggregate 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15692) Improves the explain output of several physical plans by displaying embedded logical plan in tree style

2016-06-01 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-15692:
--

 Summary: Improves the explain output of several physical plans by 
displaying embedded logical plan in tree style
 Key: SPARK-15692
 URL: https://issues.apache.org/jira/browse/SPARK-15692
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sean Zhong
Priority: Minor


Some physical plan contains a embedded logical plan, for example, {code}cache 
tableName query{code} maps to:

{code}
case class CacheTableCommand(
tableName: String,
plan: Option[LogicalPlan],
isLazy: Boolean) 
  extends RunnableCommand
{code}

It is easier to read the explain output if we can display the `plan` in tree 
style.

Before change:

{code}
scala> Seq((1,2)).toDF().createOrReplaceTempView("testView")
scala> spark.sql("cache table testView2 select * from testView").explain()
== Physical Plan ==
ExecutedCommand CacheTableCommand testView2, Some('Project [*]
+- 'UnresolvedRelation `testView`, None
), false
{code}

After change:

{code}
scala> spark.sql("cache table testView2 select * from testView").explain()
== Physical Plan ==
ExecutedCommand
:  +- CacheTableCommand testView2, false
: :  +- 'Project [*]
: : +- 'UnresolvedRelation `testView`, None
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15674) Deprecates "CREATE TEMPORARY TABLE USING...", use "CREATE TEMPORARY VIEW USING..." instead.

2016-05-31 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-15674:
--

 Summary: Deprecates "CREATE TEMPORARY TABLE USING...", use "CREATE 
TEMPORARY VIEW USING..." instead.
 Key: SPARK-15674
 URL: https://issues.apache.org/jira/browse/SPARK-15674
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Sean Zhong
Priority: Minor


The current implementation of "CREATE TEMPORARY TABLE USING..." is actually 
creating a temporary VIEW behind the scene.

We probably should just use "CREATE TEMPORARY VIEW USING..." instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15495) Improve the output of explain for aggregate operator

2016-05-23 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-15495:
--

 Summary: Improve the output of explain for aggregate operator
 Key: SPARK-15495
 URL: https://issues.apache.org/jira/browse/SPARK-15495
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2, 2.0.0
Reporter: Sean Zhong
Priority: Minor


The current output of explain has several problems:
1. the output format is not consistent.
2. Some key information is missing.
3. For some nodes, the output message is too verbose.
4. We don't know the output fields of each physical fields.
and ...

I'll add more comments later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15334) HiveClient facade not compatible with Hive 0.12

2016-05-15 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-15334:
---
Summary: HiveClient facade not compatible with Hive 0.12  (was: 
[SPARK][SQL] HiveClient facade not compatible with Hive 0.12)

> HiveClient facade not compatible with Hive 0.12
> ---
>
> Key: SPARK-15334
> URL: https://issues.apache.org/jira/browse/SPARK-15334
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Minor
>
> Compatibility issues:
> 1. org.apache.spark.sql.hive.client.HiveClientImpl use AddPartitionDesc(db, 
> table, ignoreIfExists) to create partitions, while Hive 0.12 doesn't have 
> this constructor for AddPartitionDesc.
> 2. HiveClientImpl uses PartitionDropOptions when dropping partition, however, 
> PartitionDropOptions doesn't exist in Hive 0.12.
> 3. Hive 0.12 doesn't support adding permanent functions. It is not valid to 
> call "org.apache.hadoop.hive.ql.metadata.Hive.createFunction", 
> "org.apache.hadoop.hive.ql.metadata.Hive.alterFunction", and 
> "org.apache.hadoop.hive.ql.metadata.Hive.alterFunction"
> 4. org.apache.spark.sql.hive.client.HiveClient doesn't have enough test 
> coverage. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >