[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-07-19 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633016#comment-14633016
 ] 

Jon Chase commented on SPARK-6962:
--

I'll check tomorrow on 1.4.0.

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-17 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500406#comment-14500406
 ] 

Jon Chase commented on SPARK-6962:
--

I'm tailing the executor logs before/as this is happening and I don't see 
anything out of the ordinary (errors, etc.)  Here's what the logs look like 
when the lockup occurs (again, not seeing anything out of the ordinary).  I 
tailed all executor's, and all of the logs look similar to this.

== 
/mnt/var/log/hadoop/yarn-hadoop-nodemanager-ip-XX-XX-XX-XXX.eu-west-1.compute.internal.log
 ==
2015-04-17 18:27:58,206 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 11216 for container-id 
container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 
11.3 GB of 50 GB virtual memory used
2015-04-17 18:28:01,214 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 11216 for container-id 
container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 
11.3 GB of 50 GB virtual memory used
2015-04-17 18:28:04,221 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 11216 for container-id 
container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 
11.3 GB of 50 GB virtual memory used
2015-04-17 18:28:07,229 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 11216 for container-id 
container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 
11.3 GB of 50 GB virtual memory used

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-17 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500419#comment-14500419
 ] 

Jon Chase commented on SPARK-6962:
--

Looking at the UI when the lock up occurs, I see that every executor has 4 
active tasks.  It's not the case that, say, only a single executor has a task 
running - they all appear to be busy while locked up.  

 Netty BlockTransferService hangs in the middle of SQL query
 ---

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query

2015-04-17 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500429#comment-14500429
 ] 

Jon Chase commented on SPARK-6962:
--

Here's the stderr from the executors at the time of the lock up (there are 3 
executors).

18:26:00 is when the lockup happened, and after 20+ minutes, these are still 
the most recent logs in executor 1:

15/04/17 18:26:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
1132
15/04/17 18:26:00 INFO executor.Executor: Running task 110.0 in stage 15.0 (TID 
1132)
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 
non-empty blocks out of 1008 blocks
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote 
fetches in 3 ms
15/04/17 18:26:00 INFO executor.Executor: Finished task 107.0 in stage 15.0 
(TID 1129). 8325 bytes result sent to driver
15/04/17 18:26:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
1133
15/04/17 18:26:00 INFO executor.Executor: Running task 111.0 in stage 15.0 (TID 
1133)
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 
non-empty blocks out of 1008 blocks
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote 
fetches in 2 ms




Here's executor 2, it doesn't have any activity for about 20 minutes (again, 
the lockup happened at ~18:26:00):

15/04/17 18:25:48 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 
non-empty blocks out of 1008 blocks
15/04/17 18:25:48 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote 
fetches in 11 ms
15/04/17 18:25:49 INFO executor.Executor: Finished task 13.0 in stage 15.0 (TID 
1035). 12013 bytes result sent to driver
15/04/17 18:25:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
1068
15/04/17 18:25:49 INFO executor.Executor: Running task 46.0 in stage 15.0 (TID 
1068)
15/04/17 18:25:49 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 
non-empty blocks out of 1008 blocks
15/04/17 18:25:49 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote 
fetches in 16 ms
15/04/17 18:41:19 WARN server.TransportChannelHandler: Exception in connection 
from /10.106.144.109:49697
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/04/17 18:41:27 WARN server.TransportChannelHandler: Exception in connection 
from /10.106.145.10:38473
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)



Same with executor 3:

15/04/17 18:25:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
1092
15/04/17 18:25:52 INFO 

[jira] [Comment Edited] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete

2015-04-16 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498233#comment-14498233
 ] 

Jon Chase edited comment on SPARK-6962 at 4/16/15 4:17 PM:
---

Attaching the stack dumps I took when Spark is hanging.


was (Author: jonchase):
Here are the stack dumps I took when Spark is hanging.

 Spark gets stuck on a step, hangs forever - jobs do not complete
 

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete

2015-04-16 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498232#comment-14498232
 ] 

Jon Chase commented on SPARK-6962:
--

I think it's different from SPARK-4395, as calling/omitting .cache() doesn't 
have any effect.  Also, once it hangs, I've never seen it finish (even after 
waiting many hours).  

Also different from SPARK-5060, I believe, as the web UI accurately reports the 
remaining tasks as unfinished (or in progress in case of the ones running when 
the hang occurs).  

Here's my original post from the email thread:

===

Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB),
executor memory 20GB, driver memory 10GB

I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread
out over roughly 2,000 Parquet files and my queries frequently hang. Simple
queries like select count(*) from ... on the entire data set work ok.
Slightly more demanding ones with group by's and some aggregate functions
(percentile_approx, avg, etc.) work ok as well, as long as I have some
criteria in my where clause to keep the number of rows down.

Once I hit some limit on query complexity and rows processed, my queries
start to hang.  I've left them for up to an hour without seeing any
progress.  No OOM's either - the job is just stuck.

I've tried setting spark.sql.shuffle.partitions to 400 and even 800, but
with the same results: usually near the end of the tasks (like 780 of 800
complete), progress just stops:

15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 788.0 in
stage 1.0 (TID 1618) in 800 ms on
ip-10-209-22-211.eu-west-1.compute.internal (748/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 793.0 in
stage 1.0 (TID 1623) in 622 ms on
ip-10-105-12-41.eu-west-1.compute.internal (749/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 797.0 in
stage 1.0 (TID 1627) in 616 ms on ip-10-90-2-201.eu-west-1.compute.internal
(750/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 799.0 in
stage 1.0 (TID 1629) in 611 ms on ip-10-90-2-201.eu-west-1.compute.internal
(751/800)
15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 795.0 in
stage 1.0 (TID 1625) in 669 ms on
ip-10-105-12-41.eu-west-1.compute.internal (752/800)

^^^ this is where it stays forever

Looking at the Spark UI, several of the executors still list active tasks.
I do see that the Shuffle Read for executors that don't have any tasks
remaining is around 100MB, whereas it's more like 10MB for the executors
that still have tasks.

The first stage, mapPartitions, always completes fine.  It's the second
stage (takeOrdered), that hangs.

I've had this issue in 1.2.0 and 1.2.1 as well as 1.3.0.  I've also
encountered it when using JSON files (instead of Parquet).


 Spark gets stuck on a step, hangs forever - jobs do not complete
 

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase

 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete

2015-04-16 Thread Jon Chase (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon Chase updated SPARK-6962:
-
Attachment: jstacks.txt

Here are the stack dumps I took when Spark is hanging.

 Spark gets stuck on a step, hangs forever - jobs do not complete
 

 Key: SPARK-6962
 URL: https://issues.apache.org/jira/browse/SPARK-6962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Jon Chase
 Attachments: jstacks.txt


 Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
 using queries in the REPL to surface this, so I mention Spark SQL) hang 
 indefinitely under certain (not totally understood) circumstances.  
 This is resolved by setting spark.shuffle.blockTransferService=nio, which 
 seems to point to netty as the issue.  Netty was set as the default for the 
 block transport layer in 1.2.0, which is when this issue started.  Setting 
 the service to nio allows queries to complete normally.
 I do not see this problem when running queries over smaller (~20 5MB files) 
 datasets.  When I increase the scope to include more data (several hundred 
 ~5MB files), the queries will get through several steps but eventuall hang  
 indefinitely.
 Here's the email chain regarding this issue, including stack traces:
 http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com
 For context, here's the announcement regarding the block transfer service 
 change: 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6570) Spark SQL arrays: explode() fails and cannot save array type to Parquet

2015-03-27 Thread Jon Chase (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon Chase updated SPARK-6570:
-
Summary: Spark SQL arrays: explode() fails and cannot save array type to 
Parquet  (was: Spark SQL explode() fails, assumes underlying SQL array is 
represented by Scala Seq)

 Spark SQL arrays: explode() fails and cannot save array type to Parquet
 -

 Key: SPARK-6570
 URL: https://issues.apache.org/jira/browse/SPARK-6570
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase

 {code}
 @Rule
 public TemporaryFolder tmp = new TemporaryFolder();
 @Test
 public void testPercentileWithExplode() throws Exception {
 StructType schema = DataTypes.createStructType(Lists.newArrayList(
 DataTypes.createStructField(col1, DataTypes.StringType, 
 false),
 DataTypes.createStructField(col2s, 
 DataTypes.createArrayType(DataTypes.IntegerType, true), true)
 ));
 JavaRDDRow rowRDD = sc.parallelize(Lists.newArrayList(
 RowFactory.create(test, new int[]{1, 2, 3})
 ));
 DataFrame df = sql.createDataFrame(rowRDD, schema);
 df.registerTempTable(df);
 df.printSchema();
 Listint[] ints = sql.sql(select col2s from df).javaRDD()
   .map(row - (int[]) row.get(0)).collect();
 assertEquals(1, ints.size());
 assertArrayEquals(new int[]{1, 2, 3}, ints.get(0));
 // fails: lateral view explode does not work: 
 java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
 ListInteger explodedInts = sql.sql(select col2 from df lateral 
 view explode(col2s) splode as col2).javaRDD()
 .map(row - row.getInt(0)).collect();
 assertEquals(3, explodedInts.size());
 assertEquals(Lists.newArrayList(1, 2, 3), explodedInts);
 // fails: java.lang.ClassCastException: [I cannot be cast to 
 scala.collection.Seq
 df.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + /parquet);
 DataFrame loadedDf = sql.load(tmp.getRoot().getAbsolutePath() + 
 /parquet);
 loadedDf.registerTempTable(loadedDf);
 Listint[] moreInts = sql.sql(select col2s from loadedDf).javaRDD()
   .map(row - (int[]) row.get(0)).collect();
 assertEquals(1, moreInts.size());
 assertArrayEquals(new int[]{1, 2, 3}, moreInts.get(0));
 }
 {code}
 {code}
 root
  |-- col1: string (nullable = false)
  |-- col2s: array (nullable = true)
  ||-- element: integer (containsNull = true)
 ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 
 (TID 15)
 java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
   at 
 org.apache.spark.sql.catalyst.expressions.Explode.eval(generators.scala:125) 
 ~[spark-catalyst_2.10-1.3.0.jar:1.3.0]
   at 
 org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:70)
  ~[spark-sql_2.10-1.3.0.jar:1.3.0]
   at 
 org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:69)
  ~[spark-sql_2.10-1.3.0.jar:1.3.0]
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 ~[scala-library-2.10.4.jar:na]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6570) Spark SQL explode() fails, assumes underlying SQL array is represented by Scala Seq

2015-03-27 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383794#comment-14383794
 ] 

Jon Chase commented on SPARK-6570:
--

Stack trace for saveAsParquetFile():

{code}
root
 |-- col1: string (nullable = false)
 |-- col2s: array (nullable = true)
 ||-- element: integer (containsNull = true)

SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 
(TID 15)
java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:185)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
 ~[parquet-hadoop-1.6.0rc3.jar:na]
at 
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) 
~[parquet-hadoop-1.6.0rc3.jar:na]
at 
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) 
~[parquet-hadoop-1.6.0rc3.jar:na]
at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:631)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) 
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.scheduler.Task.run(Task.scala:64) 
~[spark-core_2.10-1.3.0.jar:1.3.0]
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) 
~[spark-core_2.10-1.3.0.jar:1.3.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_31]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_31]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]
WARN  o.a.spark.scheduler.TaskSetManager Lost task 7.0 in stage 1.0 (TID 15, 
localhost): java.lang.ClassCastException: [I cannot be cast to 
scala.collection.Seq
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:185)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:631)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR o.a.spark.scheduler.TaskSetManager Task 7 in stage 1.0 failed 1 times; 
aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in 
stage 1.0 failed 1 times, most recent failure: Lost task 7.0 in stage 1.0 (TID 
15, localhost): java.lang.ClassCastException: [I cannot be cast to 
scala.collection.Seq
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:185)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at 

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381720#comment-14381720
 ] 

Jon Chase commented on SPARK-6554:
--

Here's a test case to reproduce the issue:

{code}
@Test
public void testSpark_6554() {
// given:
DataFrame saveDF = sql.jsonRDD(
sc.parallelize(Lists.newArrayList({\col1\: 1})),

DataTypes.createStructType(Lists.newArrayList(DataTypes.createStructField(col1,
 DataTypes.IntegerType, false;

// when:
saveDF.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + /col2=2);

// then:
DataFrame loadedDF = sql.load(tmp.getRoot().getAbsolutePath());
assertEquals(1, loadedDF.count());

assertEquals(2, loadedDF.schema().fieldNames().length);
assertEquals(col1, loadedDF.schema().fieldNames()[0]);
assertEquals(col2, loadedDF.schema().fieldNames()[1]);

loadedDF.registerTempTable(df);

// this query works
Row[] results = sql.sql(select col1, col2 from df).collect();
assertEquals(1, results.length);
assertEquals(2, results[0].size());

// this query is broken
results = sql.sql(select col1, col2 from df where col2  0).collect();
assertEquals(1, results.length);
assertEquals(2, results[0].size());
}
{code}

 Cannot use partition columns in where clause
 

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 ...
  |-- probeTypeId: integer (nullable = true)
 ...
 Parquet is also aware of the column:
  optional int32 probeTypeId;
 And this works fine:
 sqlContext.sql(select probeTypeId from df limit 1);
 ...as does df.show() - it shows the correct values for the partition column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 Here's the full stack trace:
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 

[jira] [Created] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Jon Chase (JIRA)
Jon Chase created SPARK-6554:


 Summary: Cannot use partition columns in where clause
 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase


I'm having trouble referencing partition columns in my queries with Parquet.  

In the following example, 'probeTypeId' is a partition column.  For example, 
the directory structure looks like this:

/mydata
/probeTypeId=1
...files...
/probeTypeId=2
...files...

I see the column when I reference load a DF using the /mydata directory and 
call df.printSchema():

...
 |-- probeTypeId: integer (nullable = true)
...

Parquet is also aware of the column:
 optional int32 probeTypeId;

And this works fine:

sqlContext.sql(select probeTypeId from df limit 1);

...as does df.show() - it shows the correct values for the partition column.


However, when I try to use a partition column in a where clause, I get an 
exception stating that the column was not found in the schema:

sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);



...
...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not 
found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at 
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at 
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
...
...






Here's the full stack trace:

using local[*] for master
06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction 
- debug attribute not set
06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About 
to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming 
appender as [STDOUT]
06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA 
- Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] 
for [encoder] property
06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
Setting level of ROOT logger to INFO
06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
Attaching appender named [STDOUT] to Logger[ROOT]
06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction 
- End of configuration.
06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd 
- Registering current configuration as safe fallback point

INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
INFO  org.apache.spark.SecurityManager Changing view acls to: jon
INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(jon); users with 
modify permissions: Set(jon)
INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
INFO  Remoting Starting remoting
INFO  Remoting Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@192.168.1.134:62493]
INFO  org.apache.spark.util.Utils Successfully started service 'sparkDriver' on 
port 62493.
INFO  org.apache.spark.SparkEnv Registering MapOutputTracker
INFO  org.apache.spark.SparkEnv Registering BlockManagerMaster
INFO  o.a.spark.storage.DiskBlockManager Created local directory at 

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381831#comment-14381831
 ] 

Jon Chase commented on SPARK-6554:
--

spark.sql.parquet.filterPushdown was the problem.  Leaving it set to false 
works around the problem for now.  

Thanks for jumping on this.

 Cannot use partition columns in where clause when Parquet filter push-down is 
 enabled
 -

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian
Priority: Critical

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: