[jira] [Commented] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959664#comment-13959664
 ] 

Davis Shepherd commented on SPARK-1411:
---

I submitted a pull request to fix 1337 as well.—
Davis

On Thu, Apr 3, 2014 at 5:52 PM, Patrick Wendell (JIRA) 



> When using spark.ui.retainedStages=n only the first n stages are kept, not 
> the most recent.
> ---
>
> Key: SPARK-1411
> URL: https://issues.apache.org/jira/browse/SPARK-1411
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
> Environment: Ubuntu 12.04 LTS Precise
> 4 Nodes, 
> 16 cores each, 
> 196GB RAM each.
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png
>
>
> For any long running job with many stages, the web ui only shows the first n 
> stages of the job (where spark.ui.retainedStages=n). The most recent stages 
> are immediately dropped and are only visible for a brief time.  This renders 
> the UI pretty useless after a pretty short amount of time for a long running 
> non-streaming job. I am unsure as to whether similar results appear for 
> streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-1394) calling system.platform on worker raises IOError

2014-04-03 Thread Idan Zalzberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Idan Zalzberg updated SPARK-1394:
-

Comment: was deleted

(was: It seems that the problem originates from pyspark capturing SIGCHLD in 
daemon.py
as described here: http://stackoverflow.com/a/3837851
)

> calling system.platform on worker raises IOError
> 
>
> Key: SPARK-1394
> URL: https://issues.apache.org/jira/browse/SPARK-1394
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0
> Environment: Tested on Ubuntu and Linux, local and remote master, 
> python 2.7.*
>Reporter: Idan Zalzberg
>  Labels: pyspark
>
> A simple program that calls system.platform() on the worker fails most of the 
> time (it works some times but very rarely).
> This is critical since many libraries call that method (e.g. boto).
> Here is the trace of the attempt to call that method:
> $ /usr/local/spark/bin/pyspark
> Python 2.7.3 (default, Feb 27 2014, 20:00:17)
> [GCC 4.6.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback 
> address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1)
> 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started
> 14/04/02 18:18:38 INFO Remoting: Starting remoting
> 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://spark@10.33.102.46:36640]
> 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: 
> [akka.tcp://spark@10.33.102.46:36640]
> 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster
> 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-local-20140402181839-919f
> 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 
> MB.
> 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id 
> = ConnectionManagerId(10.33.102.46,43357)
> 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager
> 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
> block manager 10.33.102.46:43357 with 294.6 MB RAM
> 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager
> 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
> 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at 
> http://10.33.102.46:51803
> 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker
> 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b
> 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
> 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at 
> http://10.33.102.46:4040
> 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 0.9.0
>   /_/
> Using Python version 2.7.3 (default, Feb 27 2014 20:00:17)
> Spark context available as sc.
> >>> import platform
> >>> sc.parallelize([1]).map(lambda x : platform.system()).collect()
> 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at :1
> 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at :1) with 1 
> output partitions (allowLocal=false)
> 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at 
> :1)
> 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List()
> 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List()
> 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at 
> collect at :1), which has no missing parents
> 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
> (PythonRDD[1] at collect at :1)
> 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
> 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
> executor localhost: localhost (PROCESS_LOCAL)
> 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 
> 12 ms
> 14/04/02 18:19:17 INFO Executor: Running task ID 0
> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "/usr/local/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/local/spark/python/pyspark/serializers.py", line 182, in 
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/usr/local/spark/pyth

[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError

2014-04-03 Thread Idan Zalzberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959661#comment-13959661
 ] 

Idan Zalzberg commented on SPARK-1394:
--

It seems that the problem originates from pyspark capturing SIGCHLD in daemon.py
as described here: http://stackoverflow.com/a/3837851


> calling system.platform on worker raises IOError
> 
>
> Key: SPARK-1394
> URL: https://issues.apache.org/jira/browse/SPARK-1394
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0
> Environment: Tested on Ubuntu and Linux, local and remote master, 
> python 2.7.*
>Reporter: Idan Zalzberg
>  Labels: pyspark
>
> A simple program that calls system.platform() on the worker fails most of the 
> time (it works some times but very rarely).
> This is critical since many libraries call that method (e.g. boto).
> Here is the trace of the attempt to call that method:
> $ /usr/local/spark/bin/pyspark
> Python 2.7.3 (default, Feb 27 2014, 20:00:17)
> [GCC 4.6.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback 
> address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1)
> 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started
> 14/04/02 18:18:38 INFO Remoting: Starting remoting
> 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://spark@10.33.102.46:36640]
> 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: 
> [akka.tcp://spark@10.33.102.46:36640]
> 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster
> 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-local-20140402181839-919f
> 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 
> MB.
> 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id 
> = ConnectionManagerId(10.33.102.46,43357)
> 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager
> 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
> block manager 10.33.102.46:43357 with 294.6 MB RAM
> 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager
> 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
> 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at 
> http://10.33.102.46:51803
> 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker
> 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b
> 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
> 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at 
> http://10.33.102.46:4040
> 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 0.9.0
>   /_/
> Using Python version 2.7.3 (default, Feb 27 2014 20:00:17)
> Spark context available as sc.
> >>> import platform
> >>> sc.parallelize([1]).map(lambda x : platform.system()).collect()
> 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at :1
> 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at :1) with 1 
> output partitions (allowLocal=false)
> 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at 
> :1)
> 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List()
> 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List()
> 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at 
> collect at :1), which has no missing parents
> 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
> (PythonRDD[1] at collect at :1)
> 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
> 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
> executor localhost: localhost (PROCESS_LOCAL)
> 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 
> 12 ms
> 14/04/02 18:19:17 INFO Executor: Running task ID 0
> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "/usr/local/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/local/spark/python/pyspark/serializers.py", line 182, in 
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File

[jira] [Resolved] (SPARK-1337) Application web UI garbage collects newest stages instead old ones

2014-04-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1337.


Resolution: Fixed

> Application web UI garbage collects newest stages instead old ones
> --
>
> Key: SPARK-1337
> URL: https://issues.apache.org/jira/browse/SPARK-1337
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
>Reporter: Dmitry Bugaychenko
>Assignee: Patrick Wendell
> Fix For: 1.0.0, 0.9.2
>
>
> When running task with many stages (eg. streaming task) and 
> spark.ui.retainedStages set to small value (100) application UI removes 
> newest stages keeping 90 old ones... 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1337) Application web UI garbage collects newest stages instead old ones

2014-04-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1337:
---

Fix Version/s: 0.9.2
   1.0.0

> Application web UI garbage collects newest stages instead old ones
> --
>
> Key: SPARK-1337
> URL: https://issues.apache.org/jira/browse/SPARK-1337
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
>Reporter: Dmitry Bugaychenko
>Assignee: Patrick Wendell
> Fix For: 1.0.0, 0.9.2
>
>
> When running task with many stages (eg. streaming task) and 
> spark.ui.retainedStages set to small value (100) application UI removes 
> newest stages keeping 90 old ones... 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1413) Parquet messes up stdout and stdin when used in Spark REPL

2014-04-03 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1413:


 Summary: Parquet messes up stdout and stdin when used in Spark REPL
 Key: SPARK-1413
 URL: https://issues.apache.org/jira/browse/SPARK-1413
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Matei Zaharia
Priority: Critical
 Fix For: 1.0.0


I have a simple Parquet file in "foos.parquet", but after I type this code, it 
freezes the shell, to the point where I can't read or write stuff:

{code}
scala> val qc = new org.apache.spark.sql.SQLContext(sc); import qc._
qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826
import qc._

scala> qc.parquetFile("foos.parquet").saveAsTextFile("bar")
{code}

The job itself completes successfully, and "bar" contains the right text, but I 
can no longer see commands I type in, or further log output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1413) Parquet messes up stdout and stdin when used in Spark REPL

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1413:
-

Description: 
I have a simple Parquet file in "foos.parquet", but after I type this code, it 
freezes the shell, to the point where I can't read or write stuff:

scala> val qc = new org.apache.spark.sql.SQLContext(sc); import qc._
qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826
import qc._

scala> qc.parquetFile("foos.parquet").saveAsTextFile("bar")

The job itself completes successfully, and "bar" contains the right text, but I 
can no longer see commands I type in, or further log output.

  was:
I have a simple Parquet file in "foos.parquet", but after I type this code, it 
freezes the shell, to the point where I can't read or write stuff:

{code}
scala> val qc = new org.apache.spark.sql.SQLContext(sc); import qc._
qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826
import qc._

scala> qc.parquetFile("foos.parquet").saveAsTextFile("bar")
{code}

The job itself completes successfully, and "bar" contains the right text, but I 
can no longer see commands I type in, or further log output.


> Parquet messes up stdout and stdin when used in Spark REPL
> --
>
> Key: SPARK-1413
> URL: https://issues.apache.org/jira/browse/SPARK-1413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Matei Zaharia
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.0.0
>
>
> I have a simple Parquet file in "foos.parquet", but after I type this code, 
> it freezes the shell, to the point where I can't read or write stuff:
> scala> val qc = new org.apache.spark.sql.SQLContext(sc); import qc._
> qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826
> import qc._
> scala> qc.parquetFile("foos.parquet").saveAsTextFile("bar")
> The job itself completes successfully, and "bar" contains the right text, but 
> I can no longer see commands I type in, or further log output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1413) Parquet messes up stdout and stdin when used in Spark REPL

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1413:
-

Assignee: Michael Armbrust

> Parquet messes up stdout and stdin when used in Spark REPL
> --
>
> Key: SPARK-1413
> URL: https://issues.apache.org/jira/browse/SPARK-1413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Matei Zaharia
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.0.0
>
>
> I have a simple Parquet file in "foos.parquet", but after I type this code, 
> it freezes the shell, to the point where I can't read or write stuff:
> {code}
> scala> val qc = new org.apache.spark.sql.SQLContext(sc); import qc._
> qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826
> import qc._
> scala> qc.parquetFile("foos.parquet").saveAsTextFile("bar")
> {code}
> The job itself completes successfully, and "bar" contains the right text, but 
> I can no longer see commands I type in, or further log output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low

2014-04-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1412:


Priority: Minor  (was: Major)

> Disable partial aggregation automatically when reduction factor is low
> --
>
> Key: SPARK-1412
> URL: https://issues.apache.org/jira/browse/SPARK-1412
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>Priority: Minor
> Fix For: 1.1.0
>
>
> Once we see enough number of rows in partial aggregation and don't observe 
> any reduction, the aggregate operator should just turn off partial 
> aggregation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-03 Thread Min Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959561#comment-13959561
 ] 

Min Zhou edited comment on SPARK-1391 at 4/4/14 2:12 AM:
-

[~shivaram]

It should take a long time if we fundamentally solve the problem,  we need a 
ByteBuffer and an OutputStream that support more than 2GB data. Or change the 
data structure inside a block, for example , Array[ByteBuffer] to replace 
ByteBuffer.

A short term approach is that take kryo serialization as the default ser 
instead of java ser which causes data inflation.  I am attaching a patch 
following this approach.  Due to I didn't test it in a real cluster, I am not 
sending a pull request currently. 

Shivaram, please apply this patch and test it for me, thanks!


was (Author: coderplay):
[~shivaram]

It should take a long fundamentally solve the problem,  we need a ByteBuffer 
and an OutputStream that support more than 2GB data. Or change the data 
structure inside a block, for example , Array[ByteBuffer] to replace ByteBuffer.

A short term approach is that take kryo serialization as the default ser 
instead of java ser which causes data inflation.  I am attaching a patch 
following this approach.  Due to I didn't test it in a real cluster, I am not 
sending a pull request currently. 

Shivaram, please apply this patch and test it for me, thanks!

> BlockManager cannot transfer blocks larger than 2G in size
> --
>
> Key: SPARK-1391
> URL: https://issues.apache.org/jira/browse/SPARK-1391
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 1.0.0
>Reporter: Shivaram Venkataraman
>Assignee: Min Zhou
> Attachments: SPARK-1391.diff
>
>
> If a task tries to remotely access a cached RDD block, I get an exception 
> when the block size is > 2G. The exception is pasted below.
> Memory capacities are huge these days (> 60G), and many workflows depend on 
> having large blocks in memory, so it would be good to fix this bug.
> I don't know if the same thing happens on shuffles if one transfer (from 
> mapper to reducer) is > 2G.
> {noformat}
> 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
> message
> java.lang.ArrayIndexOutOfBoundsException
> at 
> it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
> at 
> org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
> at 
> org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
> at 
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
> at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
> at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> a

[jira] [Updated] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-03 Thread Min Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Zhou updated SPARK-1391:


Attachment: SPARK-1391.diff

[~shivaram]

It should take a long fundamentally solve the problem,  we need a ByteBuffer 
and an OutputStream that support more than 2GB data. Or change the data 
structure inside a block, for example , Array[ByteBuffer] to replace ByteBuffer.

A short term approach is that take kryo serialization as the default ser 
instead of java ser which causes data inflation.  I am attaching a patch 
following this approach.  Due to I didn't test it in a real cluster, I am not 
sending a pull request currently. 

Shivaram, please apply this patch and test it for me, thanks!

> BlockManager cannot transfer blocks larger than 2G in size
> --
>
> Key: SPARK-1391
> URL: https://issues.apache.org/jira/browse/SPARK-1391
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 1.0.0
>Reporter: Shivaram Venkataraman
>Assignee: Min Zhou
> Attachments: SPARK-1391.diff
>
>
> If a task tries to remotely access a cached RDD block, I get an exception 
> when the block size is > 2G. The exception is pasted below.
> Memory capacities are huge these days (> 60G), and many workflows depend on 
> having large blocks in memory, so it would be good to fix this bug.
> I don't know if the same thing happens on shuffles if one transfer (from 
> mapper to reducer) is > 2G.
> {noformat}
> 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
> message
> java.lang.ArrayIndexOutOfBoundsException
> at 
> it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
> at 
> org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
> at 
> org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
> at 
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
> at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
> at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at 
> org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
> at 
> org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
> at 
> org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
> at 
> org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
> at 
> java.util.concur

[jira] [Created] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low

2014-04-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-1412:
--

 Summary: Disable partial aggregation automatically when reduction 
factor is low
 Key: SPARK-1412
 URL: https://issues.apache.org/jira/browse/SPARK-1412
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Michael Armbrust
 Fix For: 1.1.0


Once we see enough number of rows in partial aggregation and don't observe any 
reduction, the aggregate operator should just turn off partial aggregation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1337) Application web UI garbage collects newest stages instead old ones

2014-04-03 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959547#comment-13959547
 ] 

Patrick Wendell edited comment on SPARK-1337 at 4/4/14 1:52 AM:


https://github.com/apache/spark/pull/320

This is a simple fix if people want to pull this in and build Spark on their 
own.


was (Author: pwendell):
https://github.com/apache/spark/pull/320

> Application web UI garbage collects newest stages instead old ones
> --
>
> Key: SPARK-1337
> URL: https://issues.apache.org/jira/browse/SPARK-1337
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
>Reporter: Dmitry Bugaychenko
>
> When running task with many stages (eg. streaming task) and 
> spark.ui.retainedStages set to small value (100) application UI removes 
> newest stages keeping 90 old ones... 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1337) Application web UI garbage collects newest stages instead old ones

2014-04-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reassigned SPARK-1337:
--

Assignee: Patrick Wendell

> Application web UI garbage collects newest stages instead old ones
> --
>
> Key: SPARK-1337
> URL: https://issues.apache.org/jira/browse/SPARK-1337
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
>Reporter: Dmitry Bugaychenko
>Assignee: Patrick Wendell
>
> When running task with many stages (eg. streaming task) and 
> spark.ui.retainedStages set to small value (100) application UI removes 
> newest stages keeping 90 old ones... 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1337) Application web UI garbage collects newest stages instead old ones

2014-04-03 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959547#comment-13959547
 ] 

Patrick Wendell commented on SPARK-1337:


https://github.com/apache/spark/pull/320

> Application web UI garbage collects newest stages instead old ones
> --
>
> Key: SPARK-1337
> URL: https://issues.apache.org/jira/browse/SPARK-1337
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
>Reporter: Dmitry Bugaychenko
>
> When running task with many stages (eg. streaming task) and 
> spark.ui.retainedStages set to small value (100) application UI removes 
> newest stages keeping 90 old ones... 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x

2014-04-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959501#comment-13959501
 ] 

Michael Armbrust commented on SPARK-1410:
-

So I haven't tracked down why this is the case, but I think changing build.sbt 
to: scalaVersion := "2.10.4" instead might fix this problem.


> Class not found exception with application launched from sbt 0.13.x
> ---
>
> Key: SPARK-1410
> URL: https://issues.apache.org/jira/browse/SPARK-1410
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>
> sbt 0.13.x use its own loader but this is not available at worker side:
> org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: 
> sbt.classpath.ClasspathFilter@47ed40d
> A workaround is to switch to sbt 0.12.4.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959494#comment-13959494
 ] 

Patrick Wendell commented on SPARK-1411:


Ah, got it SPARK-1337

> When using spark.ui.retainedStages=n only the first n stages are kept, not 
> the most recent.
> ---
>
> Key: SPARK-1411
> URL: https://issues.apache.org/jira/browse/SPARK-1411
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
> Environment: Ubuntu 12.04 LTS Precise
> 4 Nodes, 
> 16 cores each, 
> 196GB RAM each.
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png
>
>
> For any long running job with many stages, the web ui only shows the first n 
> stages of the job (where spark.ui.retainedStages=n). The most recent stages 
> are immediately dropped and are only visible for a brief time.  This renders 
> the UI pretty useless after a pretty short amount of time for a long running 
> non-streaming job. I am unsure as to whether similar results appear for 
> streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959490#comment-13959490
 ] 

Patrick Wendell commented on SPARK-1411:


[~dgshep] is this a duplicate? What is the other JIRA?

> When using spark.ui.retainedStages=n only the first n stages are kept, not 
> the most recent.
> ---
>
> Key: SPARK-1411
> URL: https://issues.apache.org/jira/browse/SPARK-1411
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
> Environment: Ubuntu 12.04 LTS Precise
> 4 Nodes, 
> 16 cores each, 
> 196GB RAM each.
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png
>
>
> For any long running job with many stages, the web ui only shows the first n 
> stages of the job (where spark.ui.retainedStages=n). The most recent stages 
> are immediately dropped and are only visible for a brief time.  This renders 
> the UI pretty useless after a pretty short amount of time for a long running 
> non-streaming job. I am unsure as to whether similar results appear for 
> streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd closed SPARK-1411.
-

Resolution: Duplicate

> When using spark.ui.retainedStages=n only the first n stages are kept, not 
> the most recent.
> ---
>
> Key: SPARK-1411
> URL: https://issues.apache.org/jira/browse/SPARK-1411
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
> Environment: Ubuntu 12.04 LTS Precise
> 4 Nodes, 
> 16 cores each, 
> 196GB RAM each.
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png
>
>
> For any long running job with many stages, the web ui only shows the first n 
> stages of the job (where spark.ui.retainedStages=n). The most recent stages 
> are immediately dropped and are only visible for a brief time.  This renders 
> the UI pretty useless after a pretty short amount of time for a long running 
> non-streaming job. I am unsure as to whether similar results appear for 
> streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1296) Make RDDs Covariant

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1296:
-

Fix Version/s: (was: 1.0.0)

> Make RDDs Covariant
> ---
>
> Key: SPARK-1296
> URL: https://issues.apache.org/jira/browse/SPARK-1296
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> First, what is the problem with RDDs not being covariant
> {code}
> // Consider a function that takes a Seq of some trait.
> scala> trait A { val a = 1 }
> scala> def f(as: Seq[A]) = as.map(_.a)
> // A list of a concrete version of that trait can be used in this function.
> scala> class B extends A
> scala> f(new B :: Nil)
> res0: Seq[Int] = List(1)
> // Now lets try the same thing with RDDs
> scala> def f(as: org.apache.spark.rdd.RDD[A]) = as.map(_.a)
> scala> val rdd = sc.parallelize(new B :: Nil)
> rdd: org.apache.spark.rdd.RDD[B] = ParallelCollectionRDD[2] at parallelize at 
> :42
> // :(
> scala> f(rdd)
> :45: error: type mismatch;
>  found   : org.apache.spark.rdd.RDD[B]
>  required: org.apache.spark.rdd.RDD[A]
> Note: B <: A, but class RDD is invariant in type T.
> You may wish to define T as +T instead. (SLS 4.5)
>   f(rdd)
> {code}
> h2. Is it possible to make RDDs covariant?
> Probably?  In terms of the public user interface, they are *mostly* 
> covariant. (Internally we use the type parameter T in a lot of mutable state 
> that breaks the covariance contract, but I think with casting we can 
> 'promise' the compiler that we are behaving).  There are also a lot of 
> complications with other types that we return which are invariant.
> h2. What will it take to make RDDs covariant?
> As I mention above, all of our mutable internal state is going to require 
> casting to avoid using T.  This seems to be okay, it makes our life only 
> slightly harder. This extra work required because we are basically promising 
> the compiler that even if an RDD is implicitly upcast, internally we are 
> keeping all the checkpointed data of the correct type. Since an RDD is 
> immutable, we are okay!
> We also need to modify all the places where we use T in function parameters.  
> So for example:
> {code}
> def ++[U >: T : ClassTag](other: RDD[U]): RDD[U] = 
> this.union(other).asInstanceOf[RDD[U]]
> {code}
> We are now allowing you to append an RDD of a less specific type, and then 
> returning a less specific new RDD.  This I would argue is a good change. We 
> are strictly improving the power of the RDD interface, while maintaining 
> reasonable type semantics.
> h2. So, why wouldn't we do it?
> There are a lot of places where we interact with invariant types.  We return 
> both Maps and Arrays from a lot of public functions.  Arrays are invariant 
> (but if we returned immutable sequences instead we would be good), and 
> Maps are invariant in the Key (once again, immutable sequences of tuples 
> would be great here).
> I don't think this is a deal breaker, and we may even be able to get away 
> with it, without changing the returns types of these functions.  For example, 
> I think that this should work, though once again requires make promises to 
> the compiler:
> {code}
>   /**
>* Return an array that contains all of the elements in this RDD.
>*/
>   def collect[U >: T](): Array[U] = {
> val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
> Array.concat(results: _*).asInstanceOf[Array[U]]
>   }
> {code}
> I started working on this 
> [here|https://github.com/marmbrus/spark/tree/coveriantRDD].  Thoughts / 
> suggestions are welcome!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-1411:
--

Attachment: Screen Shot 2014-04-03 at 5.35.00 PM.png

Notice the large gap in stage ids and submitted time between the top two stages.

> When using spark.ui.retainedStages=n only the first n stages are kept, not 
> the most recent.
> ---
>
> Key: SPARK-1411
> URL: https://issues.apache.org/jira/browse/SPARK-1411
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
> Environment: Ubuntu 12.04 LTS Precise
> 4 Nodes, 
> 16 cores each, 
> 196GB RAM each.
>Reporter: Davis Shepherd
>Priority: Minor
> Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png
>
>
> For any long running job with many stages, the web ui only shows the first n 
> stages of the job (where spark.ui.retainedStages=n). The most recent stages 
> are immediately dropped and are only visible for a brief time.  This renders 
> the UI pretty useless after a pretty short amount of time for a long running 
> non-streaming job. I am unsure as to whether similar results appear for 
> streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-1411:
--

Priority: Major  (was: Minor)

> When using spark.ui.retainedStages=n only the first n stages are kept, not 
> the most recent.
> ---
>
> Key: SPARK-1411
> URL: https://issues.apache.org/jira/browse/SPARK-1411
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
> Environment: Ubuntu 12.04 LTS Precise
> 4 Nodes, 
> 16 cores each, 
> 196GB RAM each.
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png
>
>
> For any long running job with many stages, the web ui only shows the first n 
> stages of the job (where spark.ui.retainedStages=n). The most recent stages 
> are immediately dropped and are only visible for a brief time.  This renders 
> the UI pretty useless after a pretty short amount of time for a long running 
> non-streaming job. I am unsure as to whether similar results appear for 
> streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Davis Shepherd (JIRA)
Davis Shepherd created SPARK-1411:
-

 Summary: When using spark.ui.retainedStages=n only the first n 
stages are kept, not the most recent.
 Key: SPARK-1411
 URL: https://issues.apache.org/jira/browse/SPARK-1411
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 0.9.0
 Environment: Ubuntu 12.04 LTS Precise
4 Nodes, 
16 cores each, 
196GB RAM each.
Reporter: Davis Shepherd
Priority: Minor


For any long running job with many stages, the web ui only shows the first n 
stages of the job (where spark.ui.retainedStages=n). The most recent stages are 
immediately dropped and are only visible for a brief time.  This renders the UI 
pretty useless after a pretty short amount of time for a long running 
non-streaming job. I am unsure as to whether similar results appear for 
streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1348) Spark UI's do not bind to localhost interface anymore

2014-04-03 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959466#comment-13959466
 ] 

Kan Zhang commented on SPARK-1348:
--

JettyUtils.startJettyServer() used to bind to all interfaces, however, 
SPARK-1060 change that to only bind to a specific interface (preferably a 
non-loopback address).

If you want to revert to previous behavior, here's the patch.

https://github.com/apache/spark/pull/318

> Spark UI's do not bind to localhost interface anymore
> -
>
> Key: SPARK-1348
> URL: https://issues.apache.org/jira/browse/SPARK-1348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When running the shell or standalone master, it no longer binds to localhost. 
> I think this may have been caused by the security patch. We should figure out 
> what caused it and revert to the old behavior. Maybe we want to always bind 
> to `localhost` or just to bind to all interfaces.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-1398) Remove FindBugs jsr305 dependency

2014-04-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-1398:



> Remove FindBugs jsr305 dependency
> -
>
> Key: SPARK-1398
> URL: https://issues.apache.org/jira/browse/SPARK-1398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
>
> We're not making much use of FindBugs at this point, but findbugs-2.0.x is a 
> drop-in replacement for 1.3.9 and does offer significant improvements 
> (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we 
> want to be for Spark 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1134) ipython won't run standalone python script

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1134.
--

Resolution: Fixed

> ipython won't run standalone python script
> --
>
> Key: SPARK-1134
> URL: https://issues.apache.org/jira/browse/SPARK-1134
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Diana Carroll
>Assignee: Diana Carroll
>  Labels: pyspark
> Fix For: 1.0.0, 0.9.2
>
>
> Using Spark 0.9.0, python 2.6.6, and ipython 1.1.0.
> The problem: If I want to run a python script as a standalone app, the docs 
> say I should execute the command "pyspark myscript.py".  This works as long 
> as IPYTHON=0.  But if IPYTHON=1 this doesn't work.
> This problem arose for me because I tried to save myself typing by setting 
> IPYTHON=1 in my shell profile script. Which then meant I was unable to 
> execute pyspark standalone scripts.
> My analysis: 
> in the pyspark script, command line arguments are simply ignored if ipython 
> is used:
> {code}if [[ "$IPYTHON" = "1" ]] ; then
>   exec ipython $IPYTHON_OPTS
> else
>   exec "$PYSPARK_PYTHON" "$@"
> fi{code}
> I thought I could get around this by changing the script to pass $@.  
> However, this doesn't work: doing so results in an error saying multiple 
> spark contexts can't be run at once.
> This is because of a feature?/bug? of ipython related to the PYTHONSTARTUP 
> environment variable.  the pyspark script sets this variable to point to the 
> python/shell.py script, which initializes the Spark Context.  In regular 
> python, the PYTHONSTARTUP script runs ONLY if python is invoked in 
> interactive mode; if run with a script, it ignores the variable.  iPython 
> runs that script every time, regardless.  Which means it will always execute 
> Spark's shell.py script to initialize the spark context even when it was 
> invoked with a script.
> Proposed solution:
> short term: add this information to the Spark docs regarding iPython.  
> Something like "Note, iPython can only be used interactively.  Use regular 
> Python to execute pyspark script files."
> long term: change the pyspark script to tell if arguments are passed in; if 
> so, just call python instead of pyspark, or don't set the PYTHONSTARTUP 
> variable?  Or maybe fix shell.py to detect if it's being invoked in 
> non-interactively and not initialize sc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1134) ipython won't run standalone python script

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1134:
-

Fix Version/s: 0.9.2
   1.0.0

> ipython won't run standalone python script
> --
>
> Key: SPARK-1134
> URL: https://issues.apache.org/jira/browse/SPARK-1134
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Diana Carroll
>Assignee: Diana Carroll
>  Labels: pyspark
> Fix For: 1.0.0, 0.9.2
>
>
> Using Spark 0.9.0, python 2.6.6, and ipython 1.1.0.
> The problem: If I want to run a python script as a standalone app, the docs 
> say I should execute the command "pyspark myscript.py".  This works as long 
> as IPYTHON=0.  But if IPYTHON=1 this doesn't work.
> This problem arose for me because I tried to save myself typing by setting 
> IPYTHON=1 in my shell profile script. Which then meant I was unable to 
> execute pyspark standalone scripts.
> My analysis: 
> in the pyspark script, command line arguments are simply ignored if ipython 
> is used:
> {code}if [[ "$IPYTHON" = "1" ]] ; then
>   exec ipython $IPYTHON_OPTS
> else
>   exec "$PYSPARK_PYTHON" "$@"
> fi{code}
> I thought I could get around this by changing the script to pass $@.  
> However, this doesn't work: doing so results in an error saying multiple 
> spark contexts can't be run at once.
> This is because of a feature?/bug? of ipython related to the PYTHONSTARTUP 
> environment variable.  the pyspark script sets this variable to point to the 
> python/shell.py script, which initializes the Spark Context.  In regular 
> python, the PYTHONSTARTUP script runs ONLY if python is invoked in 
> interactive mode; if run with a script, it ignores the variable.  iPython 
> runs that script every time, regardless.  Which means it will always execute 
> Spark's shell.py script to initialize the spark context even when it was 
> invoked with a script.
> Proposed solution:
> short term: add this information to the Spark docs regarding iPython.  
> Something like "Note, iPython can only be used interactively.  Use regular 
> Python to execute pyspark script files."
> long term: change the pyspark script to tell if arguments are passed in; if 
> so, just call python instead of pyspark, or don't set the PYTHONSTARTUP 
> variable?  Or maybe fix shell.py to detect if it's being invoked in 
> non-interactively and not initialize sc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1134) ipython won't run standalone python script

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1134:
-

Assignee: Diana Carroll

> ipython won't run standalone python script
> --
>
> Key: SPARK-1134
> URL: https://issues.apache.org/jira/browse/SPARK-1134
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Diana Carroll
>Assignee: Diana Carroll
>  Labels: pyspark
>
> Using Spark 0.9.0, python 2.6.6, and ipython 1.1.0.
> The problem: If I want to run a python script as a standalone app, the docs 
> say I should execute the command "pyspark myscript.py".  This works as long 
> as IPYTHON=0.  But if IPYTHON=1 this doesn't work.
> This problem arose for me because I tried to save myself typing by setting 
> IPYTHON=1 in my shell profile script. Which then meant I was unable to 
> execute pyspark standalone scripts.
> My analysis: 
> in the pyspark script, command line arguments are simply ignored if ipython 
> is used:
> {code}if [[ "$IPYTHON" = "1" ]] ; then
>   exec ipython $IPYTHON_OPTS
> else
>   exec "$PYSPARK_PYTHON" "$@"
> fi{code}
> I thought I could get around this by changing the script to pass $@.  
> However, this doesn't work: doing so results in an error saying multiple 
> spark contexts can't be run at once.
> This is because of a feature?/bug? of ipython related to the PYTHONSTARTUP 
> environment variable.  the pyspark script sets this variable to point to the 
> python/shell.py script, which initializes the Spark Context.  In regular 
> python, the PYTHONSTARTUP script runs ONLY if python is invoked in 
> interactive mode; if run with a script, it ignores the variable.  iPython 
> runs that script every time, regardless.  Which means it will always execute 
> Spark's shell.py script to initialize the spark context even when it was 
> invoked with a script.
> Proposed solution:
> short term: add this information to the Spark docs regarding iPython.  
> Something like "Note, iPython can only be used interactively.  Use regular 
> Python to execute pyspark script files."
> long term: change the pyspark script to tell if arguments are passed in; if 
> so, just call python instead of pyspark, or don't set the PYTHONSTARTUP 
> variable?  Or maybe fix shell.py to detect if it's being invoked in 
> non-interactively and not initialize sc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1134) ipython won't run standalone python script

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1134:
-

Affects Version/s: 0.9.1

> ipython won't run standalone python script
> --
>
> Key: SPARK-1134
> URL: https://issues.apache.org/jira/browse/SPARK-1134
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Diana Carroll
>Assignee: Diana Carroll
>  Labels: pyspark
>
> Using Spark 0.9.0, python 2.6.6, and ipython 1.1.0.
> The problem: If I want to run a python script as a standalone app, the docs 
> say I should execute the command "pyspark myscript.py".  This works as long 
> as IPYTHON=0.  But if IPYTHON=1 this doesn't work.
> This problem arose for me because I tried to save myself typing by setting 
> IPYTHON=1 in my shell profile script. Which then meant I was unable to 
> execute pyspark standalone scripts.
> My analysis: 
> in the pyspark script, command line arguments are simply ignored if ipython 
> is used:
> {code}if [[ "$IPYTHON" = "1" ]] ; then
>   exec ipython $IPYTHON_OPTS
> else
>   exec "$PYSPARK_PYTHON" "$@"
> fi{code}
> I thought I could get around this by changing the script to pass $@.  
> However, this doesn't work: doing so results in an error saying multiple 
> spark contexts can't be run at once.
> This is because of a feature?/bug? of ipython related to the PYTHONSTARTUP 
> environment variable.  the pyspark script sets this variable to point to the 
> python/shell.py script, which initializes the Spark Context.  In regular 
> python, the PYTHONSTARTUP script runs ONLY if python is invoked in 
> interactive mode; if run with a script, it ignores the variable.  iPython 
> runs that script every time, regardless.  Which means it will always execute 
> Spark's shell.py script to initialize the spark context even when it was 
> invoked with a script.
> Proposed solution:
> short term: add this information to the Spark docs regarding iPython.  
> Something like "Note, iPython can only be used interactively.  Use regular 
> Python to execute pyspark script files."
> long term: change the pyspark script to tell if arguments are passed in; if 
> so, just call python instead of pyspark, or don't set the PYTHONSTARTUP 
> variable?  Or maybe fix shell.py to detect if it's being invoked in 
> non-interactively and not initialize sc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1333) Java API for running SQL queries

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1333.
--

Resolution: Fixed

> Java API for running SQL queries
> 
>
> Key: SPARK-1333
> URL: https://issues.apache.org/jira/browse/SPARK-1333
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1162) Add top() and takeOrdered() to PySpark

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1162:
-

Assignee: Prashant Sharma  (was: prashant)

> Add top() and takeOrdered() to PySpark
> --
>
> Key: SPARK-1162
> URL: https://issues.apache.org/jira/browse/SPARK-1162
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Matei Zaharia
>Assignee: Prashant Sharma
>Priority: Minor
> Fix For: 1.0.0, 0.9.2
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1162) Add top() and takeOrdered() to PySpark

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1162:
-

Fix Version/s: 0.9.2
   1.0.0

> Add top() and takeOrdered() to PySpark
> --
>
> Key: SPARK-1162
> URL: https://issues.apache.org/jira/browse/SPARK-1162
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Matei Zaharia
>Assignee: Prashant Sharma
>Priority: Minor
> Fix For: 1.0.0, 0.9.2
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1162) Add top() and takeOrdered() to PySpark

2014-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1162.
--

Resolution: Fixed

> Add top() and takeOrdered() to PySpark
> --
>
> Key: SPARK-1162
> URL: https://issues.apache.org/jira/browse/SPARK-1162
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Matei Zaharia
>Assignee: Prashant Sharma
>Priority: Minor
> Fix For: 1.0.0, 0.9.2
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x

2014-04-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959363#comment-13959363
 ] 

Xiangrui Meng commented on SPARK-1410:
--

The code is available at https://github.com/mengxr/mllib-grid-search

If sbt.version is changed to 0.13.0 or 0.13.1, `sbt/sbt "run ..."` throws the 
following exception:

~~~
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: 
ClassNotFound with classloader: sbt.classpath.ClasspathFilter@45307fb
org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: 
sbt.classpath.ClasspathFilter@45307fb
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1010)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1008)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1008)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:595)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:595)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:595)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:146)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
~~~

> Class not found exception with application launched from sbt 0.13.x
> ---
>
> Key: SPARK-1410
> URL: https://issues.apache.org/jira/browse/SPARK-1410
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>
> sbt 0.13.x use its own loader but this is not available at worker side:
> org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: 
> sbt.classpath.ClasspathFilter@47ed40d
> A workaround is to switch to sbt 0.12.4.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1367) NPE when joining Parquet Relations

2014-04-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1367.
-

Resolution: Fixed

> NPE when joining Parquet Relations
> --
>
> Key: SPARK-1367
> URL: https://issues.apache.org/jira/browse/SPARK-1367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Andre Schumacher
>Priority: Blocker
> Fix For: 1.0.0
>
>
> {code}
>   test("self-join parquet files") {
> val x = ParquetTestData.testData.subquery('x)
> val y = ParquetTestData.testData.newInstance.subquery('y)
> val query = x.join(y).where("x.myint".attr === "y.myint".attr)
> query.collect()
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1360) Add Timestamp Support

2014-04-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1360.
-

Resolution: Fixed

> Add Timestamp Support
> -
>
> Key: SPARK-1360
> URL: https://issues.apache.org/jira/browse/SPARK-1360
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Add Timestamp Support for Catalyst/SQLParser/HiveQl



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x

2014-04-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959265#comment-13959265
 ] 

Michael Armbrust commented on SPARK-1410:
-

I don't understand how the classloader is ending up being required on the 
worker side.  Can you share a command to reproduce this?

> Class not found exception with application launched from sbt 0.13.x
> ---
>
> Key: SPARK-1410
> URL: https://issues.apache.org/jira/browse/SPARK-1410
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>
> sbt 0.13.x use its own loader but this is not available at worker side:
> org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: 
> sbt.classpath.ClasspathFilter@47ed40d
> A workaround is to switch to sbt 0.12.4.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1398) Remove FindBugs jsr305 dependency

2014-04-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1398.


Resolution: Fixed

> Remove FindBugs jsr305 dependency
> -
>
> Key: SPARK-1398
> URL: https://issues.apache.org/jira/browse/SPARK-1398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
>
> We're not making much use of FindBugs at this point, but findbugs-2.0.x is a 
> drop-in replacement for 1.3.9 and does offer significant improvements 
> (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we 
> want to be for Spark 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x

2014-04-03 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1410:


 Summary: Class not found exception with application launched from 
sbt 0.13.x
 Key: SPARK-1410
 URL: https://issues.apache.org/jira/browse/SPARK-1410
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


sbt 0.13.x use its own loader but this is not available at worker side:

org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: 
sbt.classpath.ClasspathFilter@47ed40d

A workaround is to switch to sbt 0.12.4.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-03 Thread Min Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959127#comment-13959127
 ] 

Min Zhou commented on SPARK-1391:
-

Yes. Communication layer use ByteBuffer array to transfer messages, but the 
receiver will convert them back to BlockMessages where each block corresponding 
to one ByteBuffer, which can't be larger than 2GB. Those BlockMessages will be 
consumed by the connection caller in everywhere we can't control. 

One approach is write an CompositeByteBuffer to overcome the 2GB limitation, 
but still can't break some other limitation of ByteBuffer interface, like 
ByteBuffer.position(),  ByteBuffer.capacity(),  ByteBuffer.remaining(), whose 
return values are still integers. 

> BlockManager cannot transfer blocks larger than 2G in size
> --
>
> Key: SPARK-1391
> URL: https://issues.apache.org/jira/browse/SPARK-1391
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 1.0.0
>Reporter: Shivaram Venkataraman
>Assignee: Min Zhou
>
> If a task tries to remotely access a cached RDD block, I get an exception 
> when the block size is > 2G. The exception is pasted below.
> Memory capacities are huge these days (> 60G), and many workflows depend on 
> having large blocks in memory, so it would be good to fix this bug.
> I don't know if the same thing happens on shuffles if one transfer (from 
> mapper to reducer) is > 2G.
> {noformat}
> 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
> message
> java.lang.ArrayIndexOutOfBoundsException
> at 
> it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
> at 
> org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
> at 
> org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
> at 
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
> at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
> at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at 
> org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
> at 
> org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
> at 
> org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
> at 
> org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
> at 
> java.util.concurrent.ThreadPool

[jira] [Created] (SPARK-1409) Flaky Test: "actor input stream" test in org.apache.spark.streaming.InputStreamsSuite

2014-04-03 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-1409:
---

 Summary: Flaky Test: "actor input stream" test in 
org.apache.spark.streaming.InputStreamsSuite
 Key: SPARK-1409
 URL: https://issues.apache.org/jira/browse/SPARK-1409
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Michael Armbrust
Assignee: Tathagata Das


Here are just a few cases:
https://travis-ci.org/apache/spark/jobs/22151827
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13709/



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1403) java.lang.ClassNotFoundException - spark on mesos

2014-04-03 Thread Bharath Bhushan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959088#comment-13959088
 ] 

Bharath Bhushan commented on SPARK-1403:


In 0.9.0 I see that the classLoader is 
org.apache.spark.repl.ExecutorClassLoader (in SparkEnv.scala)
In 1.0.0 I see that the classLoader is null. My guess is that this should not 
be null.

>  java.lang.ClassNotFoundException - spark on mesos
> --
>
> Key: SPARK-1403
> URL: https://issues.apache.org/jira/browse/SPARK-1403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: ubuntu 12.04 on vagrant
>Reporter: Bharath Bhushan
>
> I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
> executor on mesos slave throws a  java.lang.ClassNotFoundException for 
> org.apache.spark.serializer.JavaSerializer.
> The lengthy discussion is here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959067#comment-13959067
 ] 

Reynold Xin commented on SPARK-1391:


I took a quick look into this. We are using a bunch of ByteBuffer's throughout 
the block manager and the communication layer. We need to replace that 
ByteBuffer with a different interface that can handle larger arrays. 

It is fortunate that the underlying communication in Connection.scala actually 
breaks messages down into smaller trunks, so that's one less place to change. 

> BlockManager cannot transfer blocks larger than 2G in size
> --
>
> Key: SPARK-1391
> URL: https://issues.apache.org/jira/browse/SPARK-1391
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 1.0.0
>Reporter: Shivaram Venkataraman
>Assignee: Min Zhou
>
> If a task tries to remotely access a cached RDD block, I get an exception 
> when the block size is > 2G. The exception is pasted below.
> Memory capacities are huge these days (> 60G), and many workflows depend on 
> having large blocks in memory, so it would be good to fix this bug.
> I don't know if the same thing happens on shuffles if one transfer (from 
> mapper to reducer) is > 2G.
> {noformat}
> 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
> message
> java.lang.ArrayIndexOutOfBoundsException
> at 
> it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
> at 
> org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
> at 
> org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
> at 
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
> at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
> at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at 
> org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
> at 
> org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
> at 
> org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
> at 
> org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread

[jira] [Commented] (SPARK-1398) Remove FindBugs jsr305 dependency

2014-04-03 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959064#comment-13959064
 ] 

Mark Hamstra commented on SPARK-1398:
-

UPDATE: We can actually get away with not including the jsr305 jar at all, so 
I've changed this JIRA and associated PR to do just that instead of bumping the 
FindBugs version.

> Remove FindBugs jsr305 dependency
> -
>
> Key: SPARK-1398
> URL: https://issues.apache.org/jira/browse/SPARK-1398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
>
> We're not making much use of FindBugs at this point, but findbugs-2.0.x is a 
> drop-in replacement for 1.3.9 and does offer significant improvements 
> (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we 
> want to be for Spark 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1398) Remove FindBugs jsr305 dependency

2014-04-03 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1398:


Summary: Remove FindBugs jsr305 dependency  (was: FindBugs 1 --> FindBugs 2 
)

> Remove FindBugs jsr305 dependency
> -
>
> Key: SPARK-1398
> URL: https://issues.apache.org/jira/browse/SPARK-1398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
>
> We're not making much use of FindBugs at this point, but findbugs-2.0.x is a 
> drop-in replacement for 1.3.9 and does offer significant improvements 
> (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we 
> want to be for Spark 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1391:
---

Description: 
If a task tries to remotely access a cached RDD block, I get an exception when 
the block size is > 2G. The exception is pasted below.

Memory capacities are huge these days (> 60G), and many workflows depend on 
having large blocks in memory, so it would be good to fix this bug.

I don't know if the same thing happens on shuffles if one transfer (from mapper 
to reducer) is > 2G.

{code}
14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
message
java.lang.ArrayIndexOutOfBoundsException
at 
it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
at 
it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
at 
it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
at 
org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
at 
org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
at 
org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
at 
org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
at 
org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
at 
org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
at 
org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}


  was:
If a task tries to remotely access a cached RDD block, I get an exception when 
the block size is > 2G. The exception is pasted below.

Memory capacities are huge these days (> 60G), and many workflows depend on 
having large blocks in memory, so it would be good to fix this bug.

I don't know if the same thing happens on shuffles if one transfer (from mapper 
to reducer) is > 2G.

14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
message
java.lang.ArrayIndexOutOfBoundsException
at 
it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
at 
it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
at 
it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.set

[jira] [Updated] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1391:
---

Description: 
If a task tries to remotely access a cached RDD block, I get an exception when 
the block size is > 2G. The exception is pasted below.

Memory capacities are huge these days (> 60G), and many workflows depend on 
having large blocks in memory, so it would be good to fix this bug.

I don't know if the same thing happens on shuffles if one transfer (from mapper 
to reducer) is > 2G.

{noformat}
14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
message
java.lang.ArrayIndexOutOfBoundsException
at 
it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
at 
it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
at 
it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
at 
org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
at 
org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
at 
org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
at 
org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
at 
org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
at 
org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
at 
org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{noformat}


  was:
If a task tries to remotely access a cached RDD block, I get an exception when 
the block size is > 2G. The exception is pasted below.

Memory capacities are huge these days (> 60G), and many workflows depend on 
having large blocks in memory, so it would be good to fix this bug.

I don't know if the same thing happens on shuffles if one transfer (from mapper 
to reducer) is > 2G.

{code}
14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
message
java.lang.ArrayIndexOutOfBoundsException
at 
it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
at 
it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
at 
it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at 
java.io.ObjectOutputStream$BlockDataO

[jira] [Updated] (SPARK-1402) 3 more compression algorithms for in-memory columnar storage

2014-04-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1402:


Affects Version/s: (was: 1.0.0)

> 3 more compression algorithms for in-memory columnar storage
> 
>
> Key: SPARK-1402
> URL: https://issues.apache.org/jira/browse/SPARK-1402
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>  Labels: compression
>
> This is a followup of [SPARK-1373: Compression for In-Memory Columnar 
> storage|https://issues.apache.org/jira/browse/SPARK-1373].
> 3 more compression algorithms for in-memory columnar storage should be 
> implemented:
> * {{BooleanBitSet}}
> * {{IntDelta}}
> * {{LongDelta}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1402) 3 more compression algorithms for in-memory columnar storage

2014-04-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1402:


Fix Version/s: 1.0.0

> 3 more compression algorithms for in-memory columnar storage
> 
>
> Key: SPARK-1402
> URL: https://issues.apache.org/jira/browse/SPARK-1402
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>  Labels: compression
> Fix For: 1.0.0
>
>
> This is a followup of [SPARK-1373: Compression for In-Memory Columnar 
> storage|https://issues.apache.org/jira/browse/SPARK-1373].
> 3 more compression algorithms for in-memory columnar storage should be 
> implemented:
> * {{BooleanBitSet}}
> * {{IntDelta}}
> * {{LongDelta}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1408) Modify Spark on Yarn to point to the history server when app finishes

2014-04-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-1408:


 Summary: Modify Spark on Yarn to point to the history server when 
app finishes
 Key: SPARK-1408
 URL: https://issues.apache.org/jira/browse/SPARK-1408
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves


Once the spark history server is implemented in SPARK-1276, we should modify 
spark on yarn to point to the history server url from the Yarn ResourceManager 
UI when the application finishes. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1408) Modify Spark on Yarn to point to the history server when app finishes

2014-04-03 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-1408:


Assignee: Thomas Graves

> Modify Spark on Yarn to point to the history server when app finishes
> -
>
> Key: SPARK-1408
> URL: https://issues.apache.org/jira/browse/SPARK-1408
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> Once the spark history server is implemented in SPARK-1276, we should modify 
> spark on yarn to point to the history server url from the Yarn 
> ResourceManager UI when the application finishes. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1407) EventLogging to HDFS doesn't work properly on yarn

2014-04-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-1407:


 Summary: EventLogging to HDFS doesn't work properly on yarn
 Key: SPARK-1407
 URL: https://issues.apache.org/jira/browse/SPARK-1407
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Thomas Graves
Priority: Blocker


When running on spark on yarn and accessing an HDFS file (like in the 
SparkHdfsLR example) while using the event logging configured to write logs to 
HDFS, it throws an exception at the end of the application. 

SPARK_JAVA_OPTS=-Dspark.eventLog.enabled=true 
-Dspark.eventLog.dir=hdfs:///history/spark/

14/04/03 13:41:31 INFO yarn.ApplicationMaster$$anon$1: Invoking sc stop from 
shutdown hook
Exception in thread "Thread-41" java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:398)
at 
org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1465)
at 
org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1450)
at 
org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:116)
at 
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
at 
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.util.FileLogger.flush(FileLogger.scala:137)
at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:69)
at 
org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:101)
at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:67)
at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.post(LiveListenerBus.scala:78)
at 
org.apache.spark.SparkContext.postApplicationEnd(SparkContext.scala:1081)
at org.apache.spark.SparkContext.stop(SparkContext.scala:828)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:460)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1406) PMML model evaluation support via MLib

2014-04-03 Thread Thomas Darimont (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958776#comment-13958776
 ] 

Thomas Darimont edited comment on SPARK-1406 at 4/3/14 12:59 PM:
-

Hi Sean,

thanks for responding so quickly :)

Sure you can do that of course (thats what I currently do), but there IMHO many 
interesting use cases that would benefit from having direct PMML support, e.g.:
1) Initialize an algorithm with a set of prepared parameters by loading a PMML 
file and evaluate the algorthm with spark's infrastructure.
2) Abstract the configuration or construction of an Algorithm via some kind of 
Producer that gets the PMML model as an Input and returns a fully configured 
Spark representation of the algorithm which is encoded in the PMML.
3) Support hot-replacing an algorithm (configuration) at runtime by just 
providing an updated PMML model to the spark infrastructure.
4) Use the transformation / normalization for input values or make use of the 
dynamic model selection support build into PMML to select the appropriate 
algorithm (configuration) based on the input.

You could even use JPMML to get the PMML object model as a starting point.


was (Author: thomasd):
Hi Sean,

thanks for responding so quickly :)

Sure you can do that of course (thats what I currently do), but there IMHO many 
interesting use cases that would benefit from having direct PMML support, e.g.:
1) Initialize an algorithm with a set of prepared parameters by loading a PMML 
file and evaluate the algorthm with spark's infrastructure.
2) Abstract the configuration or construction of an Algorithm via some kind of 
Producer that gets the PMML model as an Input and returns a fully configured 
Spark representation of the algorithm which is encoded in the PMML.
3) Support hot-replacing an algorithm (configuration) at runtime by just 
providing an updated PMML model to the spark infrastructure.
4) Use the transformation / normalization or even dynamic model selection 
support build into PMML to select the appropriate algorithm (configuration) 
based on the input.

You could even use JPMML to get the PMML object model as a starting point.

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1406) PMML model evaluation support via MLib

2014-04-03 Thread Thomas Darimont (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958776#comment-13958776
 ] 

Thomas Darimont edited comment on SPARK-1406 at 4/3/14 12:49 PM:
-

Hi Sean,

thanks for responding so quickly :)

Sure you can do that of course (thats what I currently do), but there IMHO many 
interesting use cases that would benefit from having direct PMML support, e.g.:
1) Initialize an algorithm with a set of prepared parameters by loading a PMML 
file and evaluate the algorthm with spark's infrastructure.
2) Abstract the configuration or construction of an Algorithm via some kind of 
Producer that gets the PMML model as an Input and returns a fully configured 
Spark representation of the algorithm which is encoded in the PMML.
3) Support hot-replacing an algorithm (configuration) at runtime by just 
providing an updated PMML model to the spark infrastructure.
4) Use the transformation / normalization or even dynamic model selection 
support build into PMML to select the appropriate algorithm (configuration) 
based on the input.

You could even use JPMML to get the PMML object model as a starting point.


was (Author: thomasd):
Hi Sean,

thanks for responding so quickly :)

Sure you can do that of course (thats what I currently do), but there IMHO many 
interesting use cases that would benefit from having direct PMML support, e.g.:
1) Initialize an algorithm with a set of prepared parameters by loading a PMML 
file and evaluate the algorthm with spark's infrastructure.
2) Abstract the configuration or construction of an Algorithm via some kind of 
Producer that gets the PMML model as an Input and returns a fully configured 
Spark representation of the algorithm which is encoded in the PMML.
3) Support hot-replacing an algorithm (configuration) at runtime by just 
providing an updated PMML model to the spark infrastructure.

You could even use JPMML to get the PMML object model as a starting point.

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-04-03 Thread Thomas Darimont (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958776#comment-13958776
 ] 

Thomas Darimont commented on SPARK-1406:


Hi Sean,

thanks for responding so quickly :)

Sure you can do that of course (thats what I currently do), but there IMHO many 
interesting use cases that would benefit from having direct PMML support, e.g.:
1) Initialize an algorithm with a set of prepared parameters by loading a PMML 
file and evaluate the algorthm with spark's infrastructure.
2) Abstract the configuration or construction of an Algorithm via some kind of 
Producer that gets the PMML model as an Input and returns a fully configured 
Spark representation of the algorithm which is encoded in the PMML.
3) Support hot-replacing an algorithm (configuration) at runtime by just 
providing an updated PMML model to the spark infrastructure.

You could even use JPMML to get the PMML object model as a starting point.

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-04-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958742#comment-13958742
 ] 

Sean Owen commented on SPARK-1406:
--

Yes, I was going to say, there's already a pretty good implementation of this. 
Why not simply call pmml-evaluator from within a Spark-based app to do scoring? 
does it really need any particular support in MLlib?

MLlib does not create PMML right now, which would seem like to something to 
tackle before scoring them anyway.

I have a meta-concern about piling on scope at such an early stage.

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1406) PMML model evaluation support via MLib

2014-04-03 Thread Thomas Darimont (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Darimont updated SPARK-1406:
---

Summary: PMML model evaluation support via MLib  (was: PMML Support)

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1406) PMML Support

2014-04-03 Thread Thomas Darimont (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Darimont updated SPARK-1406:
---

Description: 
It would be useful if spark would provide support the evaluation of PMML models 
(http://www.dmg.org/v4-2/GeneralStructure.html).

This would allow to use analytical models that were created with a statistical 
modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which would perform the 
actual model evaluation for a given input tuple. The PMML model would then just 
contain the "parameterization" of an analytical model.

Other projects like JPMML-Evaluator do a similar thing.
https://github.com/jpmml/jpmml/tree/master/pmml-evaluator

  was:
It would be useful if spark would support the evaluation of PMML models 
(http://www.dmg.org/v4-2/GeneralStructure.html).

This would allow to use an analytical model that was created via an statistical 
modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which could perform the 
actual model evaluation for a given input tuple.

Other projects like JPMML-Evaluator do a similar thing.
https://github.com/jpmml/jpmml/tree/master/pmml-evaluator


> PMML Support
> 
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1406) PMML Support

2014-04-03 Thread Thomas Darimont (JIRA)
Thomas Darimont created SPARK-1406:
--

 Summary: PMML Support
 Key: SPARK-1406
 URL: https://issues.apache.org/jira/browse/SPARK-1406
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Thomas Darimont


It would be useful if spark would support the evaluation of PMML models 
(http://www.dmg.org/v4-2/GeneralStructure.html).

This would allow to use an analytical model that was created via an statistical 
modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which could perform the 
actual model evaluation for a given input tuple.

Other projects like JPMML-Evaluator do a similar thing.
https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1097) ConcurrentModificationException

2014-04-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958671#comment-13958671
 ] 

Tsuyoshi OZAWA commented on SPARK-1097:
---

A patch by Nishkam on HADOOP-10456 has been already reviewed and will be 
committed in a few days against hadoop's trunk.

> ConcurrentModificationException
> ---
>
> Key: SPARK-1097
> URL: https://issues.apache.org/jira/browse/SPARK-1097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Fabrizio Milo
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> {noformat}
> 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
> java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
>   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
>   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
>   at java.util.HashSet.(HashSet.java:117)
>   at org.apache.hadoop.conf.Configuration.(Configuration.java:554)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1212) Support sparse data in MLlib

2014-04-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1212.


Resolution: Fixed

> Support sparse data in MLlib
> 
>
> Key: SPARK-1212
> URL: https://issues.apache.org/jira/browse/SPARK-1212
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Blocker
> Fix For: 1.0.0
>
>
> MLlib's NaiveBayes, SGD, and KMeans accept RDD[LabeledPoint] for training and 
> RDD[Array[Double]] for prediction, where LabeledPoint is a wrapper of 
> (Double, Array[Double]). Using Array[Double] could have good performance, but 
> sparse data appears quite often in practice. So I created this JIRA to 
> discuss the plan of adding sparse data support to MLlib and track its 
> progress.
> The goal is to support sparse data for training and prediction in all 
> existing algorithms in MLlib:
> * Gradient Descent
> * K-Means
> * Naive Bayes
> Previous discussions and pull requests:
> * https://github.com/mesos/spark/pull/736



--
This message was sent by Atlassian JIRA
(v6.2#6252)