[jira] [Commented] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
[ https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959664#comment-13959664 ] Davis Shepherd commented on SPARK-1411: --- I submitted a pull request to fix 1337 as well.— Davis On Thu, Apr 3, 2014 at 5:52 PM, Patrick Wendell (JIRA) > When using spark.ui.retainedStages=n only the first n stages are kept, not > the most recent. > --- > > Key: SPARK-1411 > URL: https://issues.apache.org/jira/browse/SPARK-1411 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 > Environment: Ubuntu 12.04 LTS Precise > 4 Nodes, > 16 cores each, > 196GB RAM each. >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png > > > For any long running job with many stages, the web ui only shows the first n > stages of the job (where spark.ui.retainedStages=n). The most recent stages > are immediately dropped and are only visible for a brief time. This renders > the UI pretty useless after a pretty short amount of time for a long running > non-streaming job. I am unsure as to whether similar results appear for > streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1394) calling system.platform on worker raises IOError
[ https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Idan Zalzberg updated SPARK-1394: - Comment: was deleted (was: It seems that the problem originates from pyspark capturing SIGCHLD in daemon.py as described here: http://stackoverflow.com/a/3837851 ) > calling system.platform on worker raises IOError > > > Key: SPARK-1394 > URL: https://issues.apache.org/jira/browse/SPARK-1394 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0 > Environment: Tested on Ubuntu and Linux, local and remote master, > python 2.7.* >Reporter: Idan Zalzberg > Labels: pyspark > > A simple program that calls system.platform() on the worker fails most of the > time (it works some times but very rarely). > This is critical since many libraries call that method (e.g. boto). > Here is the trace of the attempt to call that method: > $ /usr/local/spark/bin/pyspark > Python 2.7.3 (default, Feb 27 2014, 20:00:17) > [GCC 4.6.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback > address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1) > 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started > 14/04/02 18:18:38 INFO Remoting: Starting remoting > 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://spark@10.33.102.46:36640] > 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://spark@10.33.102.46:36640] > 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster > 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at > /tmp/spark-local-20140402181839-919f > 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 > MB. > 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id > = ConnectionManagerId(10.33.102.46,43357) > 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager > 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering > block manager 10.33.102.46:43357 with 294.6 MB RAM > 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager > 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server > 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at > http://10.33.102.46:51803 > 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker > 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b > 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server > 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at > http://10.33.102.46:4040 > 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 0.9.0 > /_/ > Using Python version 2.7.3 (default, Feb 27 2014 20:00:17) > Spark context available as sc. > >>> import platform > >>> sc.parallelize([1]).map(lambda x : platform.system()).collect() > 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at :1 > 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at :1) with 1 > output partitions (allowLocal=false) > 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at > :1) > 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List() > 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List() > 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at > collect at :1), which has no missing parents > 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 > (PythonRDD[1] at collect at :1) > 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks > 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on > executor localhost: localhost (PROCESS_LOCAL) > 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in > 12 ms > 14/04/02 18:19:17 INFO Executor: Running task ID 0 > PySpark worker failed with exception: > Traceback (most recent call last): > File "/usr/local/spark/python/pyspark/worker.py", line 77, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/local/spark/python/pyspark/serializers.py", line 182, in > dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/usr/local/spark/pyth
[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError
[ https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959661#comment-13959661 ] Idan Zalzberg commented on SPARK-1394: -- It seems that the problem originates from pyspark capturing SIGCHLD in daemon.py as described here: http://stackoverflow.com/a/3837851 > calling system.platform on worker raises IOError > > > Key: SPARK-1394 > URL: https://issues.apache.org/jira/browse/SPARK-1394 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0 > Environment: Tested on Ubuntu and Linux, local and remote master, > python 2.7.* >Reporter: Idan Zalzberg > Labels: pyspark > > A simple program that calls system.platform() on the worker fails most of the > time (it works some times but very rarely). > This is critical since many libraries call that method (e.g. boto). > Here is the trace of the attempt to call that method: > $ /usr/local/spark/bin/pyspark > Python 2.7.3 (default, Feb 27 2014, 20:00:17) > [GCC 4.6.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback > address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1) > 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started > 14/04/02 18:18:38 INFO Remoting: Starting remoting > 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://spark@10.33.102.46:36640] > 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://spark@10.33.102.46:36640] > 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster > 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at > /tmp/spark-local-20140402181839-919f > 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 > MB. > 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id > = ConnectionManagerId(10.33.102.46,43357) > 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager > 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering > block manager 10.33.102.46:43357 with 294.6 MB RAM > 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager > 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server > 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at > http://10.33.102.46:51803 > 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker > 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b > 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server > 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at > http://10.33.102.46:4040 > 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 0.9.0 > /_/ > Using Python version 2.7.3 (default, Feb 27 2014 20:00:17) > Spark context available as sc. > >>> import platform > >>> sc.parallelize([1]).map(lambda x : platform.system()).collect() > 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at :1 > 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at :1) with 1 > output partitions (allowLocal=false) > 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at > :1) > 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List() > 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List() > 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at > collect at :1), which has no missing parents > 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 > (PythonRDD[1] at collect at :1) > 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks > 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on > executor localhost: localhost (PROCESS_LOCAL) > 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in > 12 ms > 14/04/02 18:19:17 INFO Executor: Running task ID 0 > PySpark worker failed with exception: > Traceback (most recent call last): > File "/usr/local/spark/python/pyspark/worker.py", line 77, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/local/spark/python/pyspark/serializers.py", line 182, in > dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File
[jira] [Resolved] (SPARK-1337) Application web UI garbage collects newest stages instead old ones
[ https://issues.apache.org/jira/browse/SPARK-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1337. Resolution: Fixed > Application web UI garbage collects newest stages instead old ones > -- > > Key: SPARK-1337 > URL: https://issues.apache.org/jira/browse/SPARK-1337 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 >Reporter: Dmitry Bugaychenko >Assignee: Patrick Wendell > Fix For: 1.0.0, 0.9.2 > > > When running task with many stages (eg. streaming task) and > spark.ui.retainedStages set to small value (100) application UI removes > newest stages keeping 90 old ones... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1337) Application web UI garbage collects newest stages instead old ones
[ https://issues.apache.org/jira/browse/SPARK-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1337: --- Fix Version/s: 0.9.2 1.0.0 > Application web UI garbage collects newest stages instead old ones > -- > > Key: SPARK-1337 > URL: https://issues.apache.org/jira/browse/SPARK-1337 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 >Reporter: Dmitry Bugaychenko >Assignee: Patrick Wendell > Fix For: 1.0.0, 0.9.2 > > > When running task with many stages (eg. streaming task) and > spark.ui.retainedStages set to small value (100) application UI removes > newest stages keeping 90 old ones... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1413) Parquet messes up stdout and stdin when used in Spark REPL
Matei Zaharia created SPARK-1413: Summary: Parquet messes up stdout and stdin when used in Spark REPL Key: SPARK-1413 URL: https://issues.apache.org/jira/browse/SPARK-1413 Project: Spark Issue Type: Bug Components: SQL Reporter: Matei Zaharia Priority: Critical Fix For: 1.0.0 I have a simple Parquet file in "foos.parquet", but after I type this code, it freezes the shell, to the point where I can't read or write stuff: {code} scala> val qc = new org.apache.spark.sql.SQLContext(sc); import qc._ qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826 import qc._ scala> qc.parquetFile("foos.parquet").saveAsTextFile("bar") {code} The job itself completes successfully, and "bar" contains the right text, but I can no longer see commands I type in, or further log output. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1413) Parquet messes up stdout and stdin when used in Spark REPL
[ https://issues.apache.org/jira/browse/SPARK-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1413: - Description: I have a simple Parquet file in "foos.parquet", but after I type this code, it freezes the shell, to the point where I can't read or write stuff: scala> val qc = new org.apache.spark.sql.SQLContext(sc); import qc._ qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826 import qc._ scala> qc.parquetFile("foos.parquet").saveAsTextFile("bar") The job itself completes successfully, and "bar" contains the right text, but I can no longer see commands I type in, or further log output. was: I have a simple Parquet file in "foos.parquet", but after I type this code, it freezes the shell, to the point where I can't read or write stuff: {code} scala> val qc = new org.apache.spark.sql.SQLContext(sc); import qc._ qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826 import qc._ scala> qc.parquetFile("foos.parquet").saveAsTextFile("bar") {code} The job itself completes successfully, and "bar" contains the right text, but I can no longer see commands I type in, or further log output. > Parquet messes up stdout and stdin when used in Spark REPL > -- > > Key: SPARK-1413 > URL: https://issues.apache.org/jira/browse/SPARK-1413 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Matei Zaharia >Assignee: Michael Armbrust >Priority: Critical > Fix For: 1.0.0 > > > I have a simple Parquet file in "foos.parquet", but after I type this code, > it freezes the shell, to the point where I can't read or write stuff: > scala> val qc = new org.apache.spark.sql.SQLContext(sc); import qc._ > qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826 > import qc._ > scala> qc.parquetFile("foos.parquet").saveAsTextFile("bar") > The job itself completes successfully, and "bar" contains the right text, but > I can no longer see commands I type in, or further log output. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1413) Parquet messes up stdout and stdin when used in Spark REPL
[ https://issues.apache.org/jira/browse/SPARK-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1413: - Assignee: Michael Armbrust > Parquet messes up stdout and stdin when used in Spark REPL > -- > > Key: SPARK-1413 > URL: https://issues.apache.org/jira/browse/SPARK-1413 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Matei Zaharia >Assignee: Michael Armbrust >Priority: Critical > Fix For: 1.0.0 > > > I have a simple Parquet file in "foos.parquet", but after I type this code, > it freezes the shell, to the point where I can't read or write stuff: > {code} > scala> val qc = new org.apache.spark.sql.SQLContext(sc); import qc._ > qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826 > import qc._ > scala> qc.parquetFile("foos.parquet").saveAsTextFile("bar") > {code} > The job itself completes successfully, and "bar" contains the right text, but > I can no longer see commands I type in, or further log output. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1412: Priority: Minor (was: Major) > Disable partial aggregation automatically when reduction factor is low > -- > > Key: SPARK-1412 > URL: https://issues.apache.org/jira/browse/SPARK-1412 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Michael Armbrust >Priority: Minor > Fix For: 1.1.0 > > > Once we see enough number of rows in partial aggregation and don't observe > any reduction, the aggregate operator should just turn off partial > aggregation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959561#comment-13959561 ] Min Zhou edited comment on SPARK-1391 at 4/4/14 2:12 AM: - [~shivaram] It should take a long time if we fundamentally solve the problem, we need a ByteBuffer and an OutputStream that support more than 2GB data. Or change the data structure inside a block, for example , Array[ByteBuffer] to replace ByteBuffer. A short term approach is that take kryo serialization as the default ser instead of java ser which causes data inflation. I am attaching a patch following this approach. Due to I didn't test it in a real cluster, I am not sending a pull request currently. Shivaram, please apply this patch and test it for me, thanks! was (Author: coderplay): [~shivaram] It should take a long fundamentally solve the problem, we need a ByteBuffer and an OutputStream that support more than 2GB data. Or change the data structure inside a block, for example , Array[ByteBuffer] to replace ByteBuffer. A short term approach is that take kryo serialization as the default ser instead of java ser which causes data inflation. I am attaching a patch following this approach. Due to I didn't test it in a real cluster, I am not sending a pull request currently. Shivaram, please apply this patch and test it for me, thanks! > BlockManager cannot transfer blocks larger than 2G in size > -- > > Key: SPARK-1391 > URL: https://issues.apache.org/jira/browse/SPARK-1391 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 1.0.0 >Reporter: Shivaram Venkataraman >Assignee: Min Zhou > Attachments: SPARK-1391.diff > > > If a task tries to remotely access a cached RDD block, I get an exception > when the block size is > 2G. The exception is pasted below. > Memory capacities are huge these days (> 60G), and many workflows depend on > having large blocks in memory, so it would be good to fix this bug. > I don't know if the same thing happens on shuffles if one transfer (from > mapper to reducer) is > 2G. > {noformat} > 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer > message > java.lang.ArrayIndexOutOfBoundsException > at > it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) > at > org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) > at > org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) > at > org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) > at > org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) > at > org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) > at > org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) > at > org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) > at > org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) > at > org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > a
[jira] [Updated] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhou updated SPARK-1391: Attachment: SPARK-1391.diff [~shivaram] It should take a long fundamentally solve the problem, we need a ByteBuffer and an OutputStream that support more than 2GB data. Or change the data structure inside a block, for example , Array[ByteBuffer] to replace ByteBuffer. A short term approach is that take kryo serialization as the default ser instead of java ser which causes data inflation. I am attaching a patch following this approach. Due to I didn't test it in a real cluster, I am not sending a pull request currently. Shivaram, please apply this patch and test it for me, thanks! > BlockManager cannot transfer blocks larger than 2G in size > -- > > Key: SPARK-1391 > URL: https://issues.apache.org/jira/browse/SPARK-1391 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 1.0.0 >Reporter: Shivaram Venkataraman >Assignee: Min Zhou > Attachments: SPARK-1391.diff > > > If a task tries to remotely access a cached RDD block, I get an exception > when the block size is > 2G. The exception is pasted below. > Memory capacities are huge these days (> 60G), and many workflows depend on > having large blocks in memory, so it would be good to fix this bug. > I don't know if the same thing happens on shuffles if one transfer (from > mapper to reducer) is > 2G. > {noformat} > 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer > message > java.lang.ArrayIndexOutOfBoundsException > at > it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) > at > org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) > at > org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) > at > org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) > at > org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) > at > org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) > at > org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) > at > org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) > at > org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) > at > org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at > org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) > at > org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) > at > org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) > at > org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) > at > java.util.concur
[jira] [Created] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low
Reynold Xin created SPARK-1412: -- Summary: Disable partial aggregation automatically when reduction factor is low Key: SPARK-1412 URL: https://issues.apache.org/jira/browse/SPARK-1412 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Michael Armbrust Fix For: 1.1.0 Once we see enough number of rows in partial aggregation and don't observe any reduction, the aggregate operator should just turn off partial aggregation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1337) Application web UI garbage collects newest stages instead old ones
[ https://issues.apache.org/jira/browse/SPARK-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959547#comment-13959547 ] Patrick Wendell edited comment on SPARK-1337 at 4/4/14 1:52 AM: https://github.com/apache/spark/pull/320 This is a simple fix if people want to pull this in and build Spark on their own. was (Author: pwendell): https://github.com/apache/spark/pull/320 > Application web UI garbage collects newest stages instead old ones > -- > > Key: SPARK-1337 > URL: https://issues.apache.org/jira/browse/SPARK-1337 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 >Reporter: Dmitry Bugaychenko > > When running task with many stages (eg. streaming task) and > spark.ui.retainedStages set to small value (100) application UI removes > newest stages keeping 90 old ones... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1337) Application web UI garbage collects newest stages instead old ones
[ https://issues.apache.org/jira/browse/SPARK-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reassigned SPARK-1337: -- Assignee: Patrick Wendell > Application web UI garbage collects newest stages instead old ones > -- > > Key: SPARK-1337 > URL: https://issues.apache.org/jira/browse/SPARK-1337 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 >Reporter: Dmitry Bugaychenko >Assignee: Patrick Wendell > > When running task with many stages (eg. streaming task) and > spark.ui.retainedStages set to small value (100) application UI removes > newest stages keeping 90 old ones... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1337) Application web UI garbage collects newest stages instead old ones
[ https://issues.apache.org/jira/browse/SPARK-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959547#comment-13959547 ] Patrick Wendell commented on SPARK-1337: https://github.com/apache/spark/pull/320 > Application web UI garbage collects newest stages instead old ones > -- > > Key: SPARK-1337 > URL: https://issues.apache.org/jira/browse/SPARK-1337 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 >Reporter: Dmitry Bugaychenko > > When running task with many stages (eg. streaming task) and > spark.ui.retainedStages set to small value (100) application UI removes > newest stages keeping 90 old ones... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959501#comment-13959501 ] Michael Armbrust commented on SPARK-1410: - So I haven't tracked down why this is the case, but I think changing build.sbt to: scalaVersion := "2.10.4" instead might fix this problem. > Class not found exception with application launched from sbt 0.13.x > --- > > Key: SPARK-1410 > URL: https://issues.apache.org/jira/browse/SPARK-1410 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Xiangrui Meng > > sbt 0.13.x use its own loader but this is not available at worker side: > org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: > sbt.classpath.ClasspathFilter@47ed40d > A workaround is to switch to sbt 0.12.4. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
[ https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959494#comment-13959494 ] Patrick Wendell commented on SPARK-1411: Ah, got it SPARK-1337 > When using spark.ui.retainedStages=n only the first n stages are kept, not > the most recent. > --- > > Key: SPARK-1411 > URL: https://issues.apache.org/jira/browse/SPARK-1411 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 > Environment: Ubuntu 12.04 LTS Precise > 4 Nodes, > 16 cores each, > 196GB RAM each. >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png > > > For any long running job with many stages, the web ui only shows the first n > stages of the job (where spark.ui.retainedStages=n). The most recent stages > are immediately dropped and are only visible for a brief time. This renders > the UI pretty useless after a pretty short amount of time for a long running > non-streaming job. I am unsure as to whether similar results appear for > streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
[ https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959490#comment-13959490 ] Patrick Wendell commented on SPARK-1411: [~dgshep] is this a duplicate? What is the other JIRA? > When using spark.ui.retainedStages=n only the first n stages are kept, not > the most recent. > --- > > Key: SPARK-1411 > URL: https://issues.apache.org/jira/browse/SPARK-1411 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 > Environment: Ubuntu 12.04 LTS Precise > 4 Nodes, > 16 cores each, > 196GB RAM each. >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png > > > For any long running job with many stages, the web ui only shows the first n > stages of the job (where spark.ui.retainedStages=n). The most recent stages > are immediately dropped and are only visible for a brief time. This renders > the UI pretty useless after a pretty short amount of time for a long running > non-streaming job. I am unsure as to whether similar results appear for > streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
[ https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd closed SPARK-1411. - Resolution: Duplicate > When using spark.ui.retainedStages=n only the first n stages are kept, not > the most recent. > --- > > Key: SPARK-1411 > URL: https://issues.apache.org/jira/browse/SPARK-1411 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 > Environment: Ubuntu 12.04 LTS Precise > 4 Nodes, > 16 cores each, > 196GB RAM each. >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png > > > For any long running job with many stages, the web ui only shows the first n > stages of the job (where spark.ui.retainedStages=n). The most recent stages > are immediately dropped and are only visible for a brief time. This renders > the UI pretty useless after a pretty short amount of time for a long running > non-streaming job. I am unsure as to whether similar results appear for > streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1296) Make RDDs Covariant
[ https://issues.apache.org/jira/browse/SPARK-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1296: - Fix Version/s: (was: 1.0.0) > Make RDDs Covariant > --- > > Key: SPARK-1296 > URL: https://issues.apache.org/jira/browse/SPARK-1296 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Michael Armbrust >Assignee: Michael Armbrust > > First, what is the problem with RDDs not being covariant > {code} > // Consider a function that takes a Seq of some trait. > scala> trait A { val a = 1 } > scala> def f(as: Seq[A]) = as.map(_.a) > // A list of a concrete version of that trait can be used in this function. > scala> class B extends A > scala> f(new B :: Nil) > res0: Seq[Int] = List(1) > // Now lets try the same thing with RDDs > scala> def f(as: org.apache.spark.rdd.RDD[A]) = as.map(_.a) > scala> val rdd = sc.parallelize(new B :: Nil) > rdd: org.apache.spark.rdd.RDD[B] = ParallelCollectionRDD[2] at parallelize at > :42 > // :( > scala> f(rdd) > :45: error: type mismatch; > found : org.apache.spark.rdd.RDD[B] > required: org.apache.spark.rdd.RDD[A] > Note: B <: A, but class RDD is invariant in type T. > You may wish to define T as +T instead. (SLS 4.5) > f(rdd) > {code} > h2. Is it possible to make RDDs covariant? > Probably? In terms of the public user interface, they are *mostly* > covariant. (Internally we use the type parameter T in a lot of mutable state > that breaks the covariance contract, but I think with casting we can > 'promise' the compiler that we are behaving). There are also a lot of > complications with other types that we return which are invariant. > h2. What will it take to make RDDs covariant? > As I mention above, all of our mutable internal state is going to require > casting to avoid using T. This seems to be okay, it makes our life only > slightly harder. This extra work required because we are basically promising > the compiler that even if an RDD is implicitly upcast, internally we are > keeping all the checkpointed data of the correct type. Since an RDD is > immutable, we are okay! > We also need to modify all the places where we use T in function parameters. > So for example: > {code} > def ++[U >: T : ClassTag](other: RDD[U]): RDD[U] = > this.union(other).asInstanceOf[RDD[U]] > {code} > We are now allowing you to append an RDD of a less specific type, and then > returning a less specific new RDD. This I would argue is a good change. We > are strictly improving the power of the RDD interface, while maintaining > reasonable type semantics. > h2. So, why wouldn't we do it? > There are a lot of places where we interact with invariant types. We return > both Maps and Arrays from a lot of public functions. Arrays are invariant > (but if we returned immutable sequences instead we would be good), and > Maps are invariant in the Key (once again, immutable sequences of tuples > would be great here). > I don't think this is a deal breaker, and we may even be able to get away > with it, without changing the returns types of these functions. For example, > I think that this should work, though once again requires make promises to > the compiler: > {code} > /** >* Return an array that contains all of the elements in this RDD. >*/ > def collect[U >: T](): Array[U] = { > val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) > Array.concat(results: _*).asInstanceOf[Array[U]] > } > {code} > I started working on this > [here|https://github.com/marmbrus/spark/tree/coveriantRDD]. Thoughts / > suggestions are welcome! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
[ https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-1411: -- Attachment: Screen Shot 2014-04-03 at 5.35.00 PM.png Notice the large gap in stage ids and submitted time between the top two stages. > When using spark.ui.retainedStages=n only the first n stages are kept, not > the most recent. > --- > > Key: SPARK-1411 > URL: https://issues.apache.org/jira/browse/SPARK-1411 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 > Environment: Ubuntu 12.04 LTS Precise > 4 Nodes, > 16 cores each, > 196GB RAM each. >Reporter: Davis Shepherd >Priority: Minor > Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png > > > For any long running job with many stages, the web ui only shows the first n > stages of the job (where spark.ui.retainedStages=n). The most recent stages > are immediately dropped and are only visible for a brief time. This renders > the UI pretty useless after a pretty short amount of time for a long running > non-streaming job. I am unsure as to whether similar results appear for > streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
[ https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-1411: -- Priority: Major (was: Minor) > When using spark.ui.retainedStages=n only the first n stages are kept, not > the most recent. > --- > > Key: SPARK-1411 > URL: https://issues.apache.org/jira/browse/SPARK-1411 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 > Environment: Ubuntu 12.04 LTS Precise > 4 Nodes, > 16 cores each, > 196GB RAM each. >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png > > > For any long running job with many stages, the web ui only shows the first n > stages of the job (where spark.ui.retainedStages=n). The most recent stages > are immediately dropped and are only visible for a brief time. This renders > the UI pretty useless after a pretty short amount of time for a long running > non-streaming job. I am unsure as to whether similar results appear for > streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
Davis Shepherd created SPARK-1411: - Summary: When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent. Key: SPARK-1411 URL: https://issues.apache.org/jira/browse/SPARK-1411 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 0.9.0 Environment: Ubuntu 12.04 LTS Precise 4 Nodes, 16 cores each, 196GB RAM each. Reporter: Davis Shepherd Priority: Minor For any long running job with many stages, the web ui only shows the first n stages of the job (where spark.ui.retainedStages=n). The most recent stages are immediately dropped and are only visible for a brief time. This renders the UI pretty useless after a pretty short amount of time for a long running non-streaming job. I am unsure as to whether similar results appear for streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1348) Spark UI's do not bind to localhost interface anymore
[ https://issues.apache.org/jira/browse/SPARK-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959466#comment-13959466 ] Kan Zhang commented on SPARK-1348: -- JettyUtils.startJettyServer() used to bind to all interfaces, however, SPARK-1060 change that to only bind to a specific interface (preferably a non-loopback address). If you want to revert to previous behavior, here's the patch. https://github.com/apache/spark/pull/318 > Spark UI's do not bind to localhost interface anymore > - > > Key: SPARK-1348 > URL: https://issues.apache.org/jira/browse/SPARK-1348 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Priority: Blocker > Fix For: 1.0.0 > > > When running the shell or standalone master, it no longer binds to localhost. > I think this may have been caused by the security patch. We should figure out > what caused it and revert to the old behavior. Maybe we want to always bind > to `localhost` or just to bind to all interfaces. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-1398) Remove FindBugs jsr305 dependency
[ https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-1398: > Remove FindBugs jsr305 dependency > - > > Key: SPARK-1398 > URL: https://issues.apache.org/jira/browse/SPARK-1398 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Mark Hamstra >Assignee: Mark Hamstra >Priority: Minor > > We're not making much use of FindBugs at this point, but findbugs-2.0.x is a > drop-in replacement for 1.3.9 and does offer significant improvements > (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we > want to be for Spark 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1134) ipython won't run standalone python script
[ https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1134. -- Resolution: Fixed > ipython won't run standalone python script > -- > > Key: SPARK-1134 > URL: https://issues.apache.org/jira/browse/SPARK-1134 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0, 0.9.1 >Reporter: Diana Carroll >Assignee: Diana Carroll > Labels: pyspark > Fix For: 1.0.0, 0.9.2 > > > Using Spark 0.9.0, python 2.6.6, and ipython 1.1.0. > The problem: If I want to run a python script as a standalone app, the docs > say I should execute the command "pyspark myscript.py". This works as long > as IPYTHON=0. But if IPYTHON=1 this doesn't work. > This problem arose for me because I tried to save myself typing by setting > IPYTHON=1 in my shell profile script. Which then meant I was unable to > execute pyspark standalone scripts. > My analysis: > in the pyspark script, command line arguments are simply ignored if ipython > is used: > {code}if [[ "$IPYTHON" = "1" ]] ; then > exec ipython $IPYTHON_OPTS > else > exec "$PYSPARK_PYTHON" "$@" > fi{code} > I thought I could get around this by changing the script to pass $@. > However, this doesn't work: doing so results in an error saying multiple > spark contexts can't be run at once. > This is because of a feature?/bug? of ipython related to the PYTHONSTARTUP > environment variable. the pyspark script sets this variable to point to the > python/shell.py script, which initializes the Spark Context. In regular > python, the PYTHONSTARTUP script runs ONLY if python is invoked in > interactive mode; if run with a script, it ignores the variable. iPython > runs that script every time, regardless. Which means it will always execute > Spark's shell.py script to initialize the spark context even when it was > invoked with a script. > Proposed solution: > short term: add this information to the Spark docs regarding iPython. > Something like "Note, iPython can only be used interactively. Use regular > Python to execute pyspark script files." > long term: change the pyspark script to tell if arguments are passed in; if > so, just call python instead of pyspark, or don't set the PYTHONSTARTUP > variable? Or maybe fix shell.py to detect if it's being invoked in > non-interactively and not initialize sc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1134) ipython won't run standalone python script
[ https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1134: - Fix Version/s: 0.9.2 1.0.0 > ipython won't run standalone python script > -- > > Key: SPARK-1134 > URL: https://issues.apache.org/jira/browse/SPARK-1134 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0, 0.9.1 >Reporter: Diana Carroll >Assignee: Diana Carroll > Labels: pyspark > Fix For: 1.0.0, 0.9.2 > > > Using Spark 0.9.0, python 2.6.6, and ipython 1.1.0. > The problem: If I want to run a python script as a standalone app, the docs > say I should execute the command "pyspark myscript.py". This works as long > as IPYTHON=0. But if IPYTHON=1 this doesn't work. > This problem arose for me because I tried to save myself typing by setting > IPYTHON=1 in my shell profile script. Which then meant I was unable to > execute pyspark standalone scripts. > My analysis: > in the pyspark script, command line arguments are simply ignored if ipython > is used: > {code}if [[ "$IPYTHON" = "1" ]] ; then > exec ipython $IPYTHON_OPTS > else > exec "$PYSPARK_PYTHON" "$@" > fi{code} > I thought I could get around this by changing the script to pass $@. > However, this doesn't work: doing so results in an error saying multiple > spark contexts can't be run at once. > This is because of a feature?/bug? of ipython related to the PYTHONSTARTUP > environment variable. the pyspark script sets this variable to point to the > python/shell.py script, which initializes the Spark Context. In regular > python, the PYTHONSTARTUP script runs ONLY if python is invoked in > interactive mode; if run with a script, it ignores the variable. iPython > runs that script every time, regardless. Which means it will always execute > Spark's shell.py script to initialize the spark context even when it was > invoked with a script. > Proposed solution: > short term: add this information to the Spark docs regarding iPython. > Something like "Note, iPython can only be used interactively. Use regular > Python to execute pyspark script files." > long term: change the pyspark script to tell if arguments are passed in; if > so, just call python instead of pyspark, or don't set the PYTHONSTARTUP > variable? Or maybe fix shell.py to detect if it's being invoked in > non-interactively and not initialize sc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1134) ipython won't run standalone python script
[ https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1134: - Assignee: Diana Carroll > ipython won't run standalone python script > -- > > Key: SPARK-1134 > URL: https://issues.apache.org/jira/browse/SPARK-1134 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0, 0.9.1 >Reporter: Diana Carroll >Assignee: Diana Carroll > Labels: pyspark > > Using Spark 0.9.0, python 2.6.6, and ipython 1.1.0. > The problem: If I want to run a python script as a standalone app, the docs > say I should execute the command "pyspark myscript.py". This works as long > as IPYTHON=0. But if IPYTHON=1 this doesn't work. > This problem arose for me because I tried to save myself typing by setting > IPYTHON=1 in my shell profile script. Which then meant I was unable to > execute pyspark standalone scripts. > My analysis: > in the pyspark script, command line arguments are simply ignored if ipython > is used: > {code}if [[ "$IPYTHON" = "1" ]] ; then > exec ipython $IPYTHON_OPTS > else > exec "$PYSPARK_PYTHON" "$@" > fi{code} > I thought I could get around this by changing the script to pass $@. > However, this doesn't work: doing so results in an error saying multiple > spark contexts can't be run at once. > This is because of a feature?/bug? of ipython related to the PYTHONSTARTUP > environment variable. the pyspark script sets this variable to point to the > python/shell.py script, which initializes the Spark Context. In regular > python, the PYTHONSTARTUP script runs ONLY if python is invoked in > interactive mode; if run with a script, it ignores the variable. iPython > runs that script every time, regardless. Which means it will always execute > Spark's shell.py script to initialize the spark context even when it was > invoked with a script. > Proposed solution: > short term: add this information to the Spark docs regarding iPython. > Something like "Note, iPython can only be used interactively. Use regular > Python to execute pyspark script files." > long term: change the pyspark script to tell if arguments are passed in; if > so, just call python instead of pyspark, or don't set the PYTHONSTARTUP > variable? Or maybe fix shell.py to detect if it's being invoked in > non-interactively and not initialize sc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1134) ipython won't run standalone python script
[ https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1134: - Affects Version/s: 0.9.1 > ipython won't run standalone python script > -- > > Key: SPARK-1134 > URL: https://issues.apache.org/jira/browse/SPARK-1134 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0, 0.9.1 >Reporter: Diana Carroll >Assignee: Diana Carroll > Labels: pyspark > > Using Spark 0.9.0, python 2.6.6, and ipython 1.1.0. > The problem: If I want to run a python script as a standalone app, the docs > say I should execute the command "pyspark myscript.py". This works as long > as IPYTHON=0. But if IPYTHON=1 this doesn't work. > This problem arose for me because I tried to save myself typing by setting > IPYTHON=1 in my shell profile script. Which then meant I was unable to > execute pyspark standalone scripts. > My analysis: > in the pyspark script, command line arguments are simply ignored if ipython > is used: > {code}if [[ "$IPYTHON" = "1" ]] ; then > exec ipython $IPYTHON_OPTS > else > exec "$PYSPARK_PYTHON" "$@" > fi{code} > I thought I could get around this by changing the script to pass $@. > However, this doesn't work: doing so results in an error saying multiple > spark contexts can't be run at once. > This is because of a feature?/bug? of ipython related to the PYTHONSTARTUP > environment variable. the pyspark script sets this variable to point to the > python/shell.py script, which initializes the Spark Context. In regular > python, the PYTHONSTARTUP script runs ONLY if python is invoked in > interactive mode; if run with a script, it ignores the variable. iPython > runs that script every time, regardless. Which means it will always execute > Spark's shell.py script to initialize the spark context even when it was > invoked with a script. > Proposed solution: > short term: add this information to the Spark docs regarding iPython. > Something like "Note, iPython can only be used interactively. Use regular > Python to execute pyspark script files." > long term: change the pyspark script to tell if arguments are passed in; if > so, just call python instead of pyspark, or don't set the PYTHONSTARTUP > variable? Or maybe fix shell.py to detect if it's being invoked in > non-interactively and not initialize sc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1333) Java API for running SQL queries
[ https://issues.apache.org/jira/browse/SPARK-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1333. -- Resolution: Fixed > Java API for running SQL queries > > > Key: SPARK-1333 > URL: https://issues.apache.org/jira/browse/SPARK-1333 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1162) Add top() and takeOrdered() to PySpark
[ https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1162: - Assignee: Prashant Sharma (was: prashant) > Add top() and takeOrdered() to PySpark > -- > > Key: SPARK-1162 > URL: https://issues.apache.org/jira/browse/SPARK-1162 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Matei Zaharia >Assignee: Prashant Sharma >Priority: Minor > Fix For: 1.0.0, 0.9.2 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1162) Add top() and takeOrdered() to PySpark
[ https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1162: - Fix Version/s: 0.9.2 1.0.0 > Add top() and takeOrdered() to PySpark > -- > > Key: SPARK-1162 > URL: https://issues.apache.org/jira/browse/SPARK-1162 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Matei Zaharia >Assignee: Prashant Sharma >Priority: Minor > Fix For: 1.0.0, 0.9.2 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1162) Add top() and takeOrdered() to PySpark
[ https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1162. -- Resolution: Fixed > Add top() and takeOrdered() to PySpark > -- > > Key: SPARK-1162 > URL: https://issues.apache.org/jira/browse/SPARK-1162 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Matei Zaharia >Assignee: Prashant Sharma >Priority: Minor > Fix For: 1.0.0, 0.9.2 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959363#comment-13959363 ] Xiangrui Meng commented on SPARK-1410: -- The code is available at https://github.com/mengxr/mllib-grid-search If sbt.version is changed to 0.13.0 or 0.13.1, `sbt/sbt "run ..."` throws the following exception: ~~~ [error] (run-main-0) org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: sbt.classpath.ClasspathFilter@45307fb org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: sbt.classpath.ClasspathFilter@45307fb at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1010) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1008) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1008) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:595) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:595) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:595) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:146) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~~~ > Class not found exception with application launched from sbt 0.13.x > --- > > Key: SPARK-1410 > URL: https://issues.apache.org/jira/browse/SPARK-1410 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Xiangrui Meng > > sbt 0.13.x use its own loader but this is not available at worker side: > org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: > sbt.classpath.ClasspathFilter@47ed40d > A workaround is to switch to sbt 0.12.4. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1367) NPE when joining Parquet Relations
[ https://issues.apache.org/jira/browse/SPARK-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1367. - Resolution: Fixed > NPE when joining Parquet Relations > -- > > Key: SPARK-1367 > URL: https://issues.apache.org/jira/browse/SPARK-1367 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Andre Schumacher >Priority: Blocker > Fix For: 1.0.0 > > > {code} > test("self-join parquet files") { > val x = ParquetTestData.testData.subquery('x) > val y = ParquetTestData.testData.newInstance.subquery('y) > val query = x.join(y).where("x.myint".attr === "y.myint".attr) > query.collect() > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1360) Add Timestamp Support
[ https://issues.apache.org/jira/browse/SPARK-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1360. - Resolution: Fixed > Add Timestamp Support > - > > Key: SPARK-1360 > URL: https://issues.apache.org/jira/browse/SPARK-1360 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.0.0 >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Blocker > Fix For: 1.0.0 > > > Add Timestamp Support for Catalyst/SQLParser/HiveQl -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959265#comment-13959265 ] Michael Armbrust commented on SPARK-1410: - I don't understand how the classloader is ending up being required on the worker side. Can you share a command to reproduce this? > Class not found exception with application launched from sbt 0.13.x > --- > > Key: SPARK-1410 > URL: https://issues.apache.org/jira/browse/SPARK-1410 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Xiangrui Meng > > sbt 0.13.x use its own loader but this is not available at worker side: > org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: > sbt.classpath.ClasspathFilter@47ed40d > A workaround is to switch to sbt 0.12.4. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1398) Remove FindBugs jsr305 dependency
[ https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1398. Resolution: Fixed > Remove FindBugs jsr305 dependency > - > > Key: SPARK-1398 > URL: https://issues.apache.org/jira/browse/SPARK-1398 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Mark Hamstra >Assignee: Mark Hamstra >Priority: Minor > > We're not making much use of FindBugs at this point, but findbugs-2.0.x is a > drop-in replacement for 1.3.9 and does offer significant improvements > (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we > want to be for Spark 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x
Xiangrui Meng created SPARK-1410: Summary: Class not found exception with application launched from sbt 0.13.x Key: SPARK-1410 URL: https://issues.apache.org/jira/browse/SPARK-1410 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Xiangrui Meng sbt 0.13.x use its own loader but this is not available at worker side: org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: sbt.classpath.ClasspathFilter@47ed40d A workaround is to switch to sbt 0.12.4. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959127#comment-13959127 ] Min Zhou commented on SPARK-1391: - Yes. Communication layer use ByteBuffer array to transfer messages, but the receiver will convert them back to BlockMessages where each block corresponding to one ByteBuffer, which can't be larger than 2GB. Those BlockMessages will be consumed by the connection caller in everywhere we can't control. One approach is write an CompositeByteBuffer to overcome the 2GB limitation, but still can't break some other limitation of ByteBuffer interface, like ByteBuffer.position(), ByteBuffer.capacity(), ByteBuffer.remaining(), whose return values are still integers. > BlockManager cannot transfer blocks larger than 2G in size > -- > > Key: SPARK-1391 > URL: https://issues.apache.org/jira/browse/SPARK-1391 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 1.0.0 >Reporter: Shivaram Venkataraman >Assignee: Min Zhou > > If a task tries to remotely access a cached RDD block, I get an exception > when the block size is > 2G. The exception is pasted below. > Memory capacities are huge these days (> 60G), and many workflows depend on > having large blocks in memory, so it would be good to fix this bug. > I don't know if the same thing happens on shuffles if one transfer (from > mapper to reducer) is > 2G. > {noformat} > 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer > message > java.lang.ArrayIndexOutOfBoundsException > at > it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) > at > org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) > at > org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) > at > org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) > at > org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) > at > org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) > at > org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) > at > org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) > at > org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) > at > org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at > org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) > at > org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) > at > org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) > at > org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) > at > java.util.concurrent.ThreadPool
[jira] [Created] (SPARK-1409) Flaky Test: "actor input stream" test in org.apache.spark.streaming.InputStreamsSuite
Michael Armbrust created SPARK-1409: --- Summary: Flaky Test: "actor input stream" test in org.apache.spark.streaming.InputStreamsSuite Key: SPARK-1409 URL: https://issues.apache.org/jira/browse/SPARK-1409 Project: Spark Issue Type: Bug Components: Streaming Reporter: Michael Armbrust Assignee: Tathagata Das Here are just a few cases: https://travis-ci.org/apache/spark/jobs/22151827 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13709/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1403) java.lang.ClassNotFoundException - spark on mesos
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959088#comment-13959088 ] Bharath Bhushan commented on SPARK-1403: In 0.9.0 I see that the classLoader is org.apache.spark.repl.ExecutorClassLoader (in SparkEnv.scala) In 1.0.0 I see that the classLoader is null. My guess is that this should not be null. > java.lang.ClassNotFoundException - spark on mesos > -- > > Key: SPARK-1403 > URL: https://issues.apache.org/jira/browse/SPARK-1403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: ubuntu 12.04 on vagrant >Reporter: Bharath Bhushan > > I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark > executor on mesos slave throws a java.lang.ClassNotFoundException for > org.apache.spark.serializer.JavaSerializer. > The lengthy discussion is here: > http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959067#comment-13959067 ] Reynold Xin commented on SPARK-1391: I took a quick look into this. We are using a bunch of ByteBuffer's throughout the block manager and the communication layer. We need to replace that ByteBuffer with a different interface that can handle larger arrays. It is fortunate that the underlying communication in Connection.scala actually breaks messages down into smaller trunks, so that's one less place to change. > BlockManager cannot transfer blocks larger than 2G in size > -- > > Key: SPARK-1391 > URL: https://issues.apache.org/jira/browse/SPARK-1391 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 1.0.0 >Reporter: Shivaram Venkataraman >Assignee: Min Zhou > > If a task tries to remotely access a cached RDD block, I get an exception > when the block size is > 2G. The exception is pasted below. > Memory capacities are huge these days (> 60G), and many workflows depend on > having large blocks in memory, so it would be good to fix this bug. > I don't know if the same thing happens on shuffles if one transfer (from > mapper to reducer) is > 2G. > {noformat} > 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer > message > java.lang.ArrayIndexOutOfBoundsException > at > it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) > at > org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) > at > org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) > at > org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) > at > org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) > at > org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) > at > org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) > at > org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) > at > org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) > at > org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at > org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) > at > org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) > at > org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) > at > org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread
[jira] [Commented] (SPARK-1398) Remove FindBugs jsr305 dependency
[ https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959064#comment-13959064 ] Mark Hamstra commented on SPARK-1398: - UPDATE: We can actually get away with not including the jsr305 jar at all, so I've changed this JIRA and associated PR to do just that instead of bumping the FindBugs version. > Remove FindBugs jsr305 dependency > - > > Key: SPARK-1398 > URL: https://issues.apache.org/jira/browse/SPARK-1398 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Mark Hamstra >Assignee: Mark Hamstra >Priority: Minor > > We're not making much use of FindBugs at this point, but findbugs-2.0.x is a > drop-in replacement for 1.3.9 and does offer significant improvements > (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we > want to be for Spark 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1398) Remove FindBugs jsr305 dependency
[ https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-1398: Summary: Remove FindBugs jsr305 dependency (was: FindBugs 1 --> FindBugs 2 ) > Remove FindBugs jsr305 dependency > - > > Key: SPARK-1398 > URL: https://issues.apache.org/jira/browse/SPARK-1398 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Mark Hamstra >Assignee: Mark Hamstra >Priority: Minor > > We're not making much use of FindBugs at this point, but findbugs-2.0.x is a > drop-in replacement for 1.3.9 and does offer significant improvements > (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we > want to be for Spark 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1391: --- Description: If a task tries to remotely access a cached RDD block, I get an exception when the block size is > 2G. The exception is pasted below. Memory capacities are huge these days (> 60G), and many workflows depend on having large blocks in memory, so it would be good to fix this bug. I don't know if the same thing happens on shuffles if one transfer (from mapper to reducer) is > 2G. {code} 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer message java.lang.ArrayIndexOutOfBoundsException at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) at org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) at org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} was: If a task tries to remotely access a cached RDD block, I get an exception when the block size is > 2G. The exception is pasted below. Memory capacities are huge these days (> 60G), and many workflows depend on having large blocks in memory, so it would be good to fix this bug. I don't know if the same thing happens on shuffles if one transfer (from mapper to reducer) is > 2G. 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer message java.lang.ArrayIndexOutOfBoundsException at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.set
[jira] [Updated] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1391: --- Description: If a task tries to remotely access a cached RDD block, I get an exception when the block size is > 2G. The exception is pasted below. Memory capacities are huge these days (> 60G), and many workflows depend on having large blocks in memory, so it would be good to fix this bug. I don't know if the same thing happens on shuffles if one transfer (from mapper to reducer) is > 2G. {noformat} 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer message java.lang.ArrayIndexOutOfBoundsException at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) at org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) at org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} was: If a task tries to remotely access a cached RDD block, I get an exception when the block size is > 2G. The exception is pasted below. Memory capacities are huge these days (> 60G), and many workflows depend on having large blocks in memory, so it would be good to fix this bug. I don't know if the same thing happens on shuffles if one transfer (from mapper to reducer) is > 2G. {code} 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer message java.lang.ArrayIndexOutOfBoundsException at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataO
[jira] [Updated] (SPARK-1402) 3 more compression algorithms for in-memory columnar storage
[ https://issues.apache.org/jira/browse/SPARK-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1402: Affects Version/s: (was: 1.0.0) > 3 more compression algorithms for in-memory columnar storage > > > Key: SPARK-1402 > URL: https://issues.apache.org/jira/browse/SPARK-1402 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > Labels: compression > > This is a followup of [SPARK-1373: Compression for In-Memory Columnar > storage|https://issues.apache.org/jira/browse/SPARK-1373]. > 3 more compression algorithms for in-memory columnar storage should be > implemented: > * {{BooleanBitSet}} > * {{IntDelta}} > * {{LongDelta}} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1402) 3 more compression algorithms for in-memory columnar storage
[ https://issues.apache.org/jira/browse/SPARK-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1402: Fix Version/s: 1.0.0 > 3 more compression algorithms for in-memory columnar storage > > > Key: SPARK-1402 > URL: https://issues.apache.org/jira/browse/SPARK-1402 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > Labels: compression > Fix For: 1.0.0 > > > This is a followup of [SPARK-1373: Compression for In-Memory Columnar > storage|https://issues.apache.org/jira/browse/SPARK-1373]. > 3 more compression algorithms for in-memory columnar storage should be > implemented: > * {{BooleanBitSet}} > * {{IntDelta}} > * {{LongDelta}} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1408) Modify Spark on Yarn to point to the history server when app finishes
Thomas Graves created SPARK-1408: Summary: Modify Spark on Yarn to point to the history server when app finishes Key: SPARK-1408 URL: https://issues.apache.org/jira/browse/SPARK-1408 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Once the spark history server is implemented in SPARK-1276, we should modify spark on yarn to point to the history server url from the Yarn ResourceManager UI when the application finishes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1408) Modify Spark on Yarn to point to the history server when app finishes
[ https://issues.apache.org/jira/browse/SPARK-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-1408: Assignee: Thomas Graves > Modify Spark on Yarn to point to the history server when app finishes > - > > Key: SPARK-1408 > URL: https://issues.apache.org/jira/browse/SPARK-1408 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > Once the spark history server is implemented in SPARK-1276, we should modify > spark on yarn to point to the history server url from the Yarn > ResourceManager UI when the application finishes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1407) EventLogging to HDFS doesn't work properly on yarn
Thomas Graves created SPARK-1407: Summary: EventLogging to HDFS doesn't work properly on yarn Key: SPARK-1407 URL: https://issues.apache.org/jira/browse/SPARK-1407 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Thomas Graves Priority: Blocker When running on spark on yarn and accessing an HDFS file (like in the SparkHdfsLR example) while using the event logging configured to write logs to HDFS, it throws an exception at the end of the application. SPARK_JAVA_OPTS=-Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs:///history/spark/ 14/04/03 13:41:31 INFO yarn.ApplicationMaster$$anon$1: Invoking sc stop from shutdown hook Exception in thread "Thread-41" java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:398) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1465) at org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1450) at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:116) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137) at scala.Option.foreach(Option.scala:236) at org.apache.spark.util.FileLogger.flush(FileLogger.scala:137) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:69) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:101) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:67) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.post(LiveListenerBus.scala:78) at org.apache.spark.SparkContext.postApplicationEnd(SparkContext.scala:1081) at org.apache.spark.SparkContext.stop(SparkContext.scala:828) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:460) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958776#comment-13958776 ] Thomas Darimont edited comment on SPARK-1406 at 4/3/14 12:59 PM: - Hi Sean, thanks for responding so quickly :) Sure you can do that of course (thats what I currently do), but there IMHO many interesting use cases that would benefit from having direct PMML support, e.g.: 1) Initialize an algorithm with a set of prepared parameters by loading a PMML file and evaluate the algorthm with spark's infrastructure. 2) Abstract the configuration or construction of an Algorithm via some kind of Producer that gets the PMML model as an Input and returns a fully configured Spark representation of the algorithm which is encoded in the PMML. 3) Support hot-replacing an algorithm (configuration) at runtime by just providing an updated PMML model to the spark infrastructure. 4) Use the transformation / normalization for input values or make use of the dynamic model selection support build into PMML to select the appropriate algorithm (configuration) based on the input. You could even use JPMML to get the PMML object model as a starting point. was (Author: thomasd): Hi Sean, thanks for responding so quickly :) Sure you can do that of course (thats what I currently do), but there IMHO many interesting use cases that would benefit from having direct PMML support, e.g.: 1) Initialize an algorithm with a set of prepared parameters by loading a PMML file and evaluate the algorthm with spark's infrastructure. 2) Abstract the configuration or construction of an Algorithm via some kind of Producer that gets the PMML model as an Input and returns a fully configured Spark representation of the algorithm which is encoded in the PMML. 3) Support hot-replacing an algorithm (configuration) at runtime by just providing an updated PMML model to the spark infrastructure. 4) Use the transformation / normalization or even dynamic model selection support build into PMML to select the appropriate algorithm (configuration) based on the input. You could even use JPMML to get the PMML object model as a starting point. > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958776#comment-13958776 ] Thomas Darimont edited comment on SPARK-1406 at 4/3/14 12:49 PM: - Hi Sean, thanks for responding so quickly :) Sure you can do that of course (thats what I currently do), but there IMHO many interesting use cases that would benefit from having direct PMML support, e.g.: 1) Initialize an algorithm with a set of prepared parameters by loading a PMML file and evaluate the algorthm with spark's infrastructure. 2) Abstract the configuration or construction of an Algorithm via some kind of Producer that gets the PMML model as an Input and returns a fully configured Spark representation of the algorithm which is encoded in the PMML. 3) Support hot-replacing an algorithm (configuration) at runtime by just providing an updated PMML model to the spark infrastructure. 4) Use the transformation / normalization or even dynamic model selection support build into PMML to select the appropriate algorithm (configuration) based on the input. You could even use JPMML to get the PMML object model as a starting point. was (Author: thomasd): Hi Sean, thanks for responding so quickly :) Sure you can do that of course (thats what I currently do), but there IMHO many interesting use cases that would benefit from having direct PMML support, e.g.: 1) Initialize an algorithm with a set of prepared parameters by loading a PMML file and evaluate the algorthm with spark's infrastructure. 2) Abstract the configuration or construction of an Algorithm via some kind of Producer that gets the PMML model as an Input and returns a fully configured Spark representation of the algorithm which is encoded in the PMML. 3) Support hot-replacing an algorithm (configuration) at runtime by just providing an updated PMML model to the spark infrastructure. You could even use JPMML to get the PMML object model as a starting point. > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958776#comment-13958776 ] Thomas Darimont commented on SPARK-1406: Hi Sean, thanks for responding so quickly :) Sure you can do that of course (thats what I currently do), but there IMHO many interesting use cases that would benefit from having direct PMML support, e.g.: 1) Initialize an algorithm with a set of prepared parameters by loading a PMML file and evaluate the algorthm with spark's infrastructure. 2) Abstract the configuration or construction of an Algorithm via some kind of Producer that gets the PMML model as an Input and returns a fully configured Spark representation of the algorithm which is encoded in the PMML. 3) Support hot-replacing an algorithm (configuration) at runtime by just providing an updated PMML model to the spark infrastructure. You could even use JPMML to get the PMML object model as a starting point. > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958742#comment-13958742 ] Sean Owen commented on SPARK-1406: -- Yes, I was going to say, there's already a pretty good implementation of this. Why not simply call pmml-evaluator from within a Spark-based app to do scoring? does it really need any particular support in MLlib? MLlib does not create PMML right now, which would seem like to something to tackle before scoring them anyway. I have a meta-concern about piling on scope at such an early stage. > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Darimont updated SPARK-1406: --- Summary: PMML model evaluation support via MLib (was: PMML Support) > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1406) PMML Support
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Darimont updated SPARK-1406: --- Description: It would be useful if spark would provide support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use analytical models that were created with a statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which would perform the actual model evaluation for a given input tuple. The PMML model would then just contain the "parameterization" of an analytical model. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator was: It would be useful if spark would support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use an analytical model that was created via an statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which could perform the actual model evaluation for a given input tuple. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator > PMML Support > > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1406) PMML Support
Thomas Darimont created SPARK-1406: -- Summary: PMML Support Key: SPARK-1406 URL: https://issues.apache.org/jira/browse/SPARK-1406 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Thomas Darimont It would be useful if spark would support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use an analytical model that was created via an statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which could perform the actual model evaluation for a given input tuple. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1097) ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958671#comment-13958671 ] Tsuyoshi OZAWA commented on SPARK-1097: --- A patch by Nishkam on HADOOP-10456 has been already reviewed and will be committed in a few days against hadoop's trunk. > ConcurrentModificationException > --- > > Key: SPARK-1097 > URL: https://issues.apache.org/jira/browse/SPARK-1097 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Fabrizio Milo > Attachments: nravi_Conf_Spark-1388.patch > > > {noformat} > 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to > java.util.ConcurrentModificationException > java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926) > at java.util.HashMap$KeyIterator.next(HashMap.java:960) > at java.util.AbstractCollection.addAll(AbstractCollection.java:341) > at java.util.HashSet.(HashSet.java:117) > at org.apache.hadoop.conf.Configuration.(Configuration.java:554) > at org.apache.hadoop.mapred.JobConf.(JobConf.java:439) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) > at org.apache.spark.scheduler.Task.run(Task.scala:53) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1212) Support sparse data in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1212. Resolution: Fixed > Support sparse data in MLlib > > > Key: SPARK-1212 > URL: https://issues.apache.org/jira/browse/SPARK-1212 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 0.9.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Blocker > Fix For: 1.0.0 > > > MLlib's NaiveBayes, SGD, and KMeans accept RDD[LabeledPoint] for training and > RDD[Array[Double]] for prediction, where LabeledPoint is a wrapper of > (Double, Array[Double]). Using Array[Double] could have good performance, but > sparse data appears quite often in practice. So I created this JIRA to > discuss the plan of adding sparse data support to MLlib and track its > progress. > The goal is to support sparse data for training and prediction in all > existing algorithms in MLlib: > * Gradient Descent > * K-Means > * Naive Bayes > Previous discussions and pull requests: > * https://github.com/mesos/spark/pull/736 -- This message was sent by Atlassian JIRA (v6.2#6252)