[jira] [Created] (SPARK-2613) CLONE - word2vec: Distributed Representation of Words
Yifan Yang created SPARK-2613: - Summary: CLONE - word2vec: Distributed Representation of Words Key: SPARK-2613 URL: https://issues.apache.org/jira/browse/SPARK-2613 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yifan Yang Assignee: Liquan Pei We would like to add parallel implementation of word2vec to MLlib. word2vec finds distributed representation of words through training of large data sets. The Spark programming model fits nicely with word2vec as the training algorithm of word2vec is embarrassingly parallel. We will focus on skip-gram model and negative sampling in our initial implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2421) Spark should treat writable as serializable for keys
[ https://issues.apache.org/jira/browse/SPARK-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069880#comment-14069880 ] Sandy Ryza commented on SPARK-2421: --- It should be relatively straightforward to add a WritableSerializer. One issue is that Spark doesn't pass the types in the conf in the way MR does, so on the read side we need a way to know what kind of objects to instantiate. I'm messing around with a prototype that just writes out the class name as Text at the beginning of the stream. Spark should treat writable as serializable for keys Key: SPARK-2421 URL: https://issues.apache.org/jira/browse/SPARK-2421 Project: Spark Issue Type: Improvement Components: Input/Output, Java API Affects Versions: 1.0.0 Reporter: Xuefu Zhang It seems that Spark requires the key be serializable (class implement Serializable interface). In Hadoop world, Writable interface is used for the same purpose. A lot of existing classes, while writable, are not considered by Spark as Serializable. It would be nice if Spark can treate Writable as serializable and automatically serialize and de-serialize these classes using writable interface. This is identified in HIVE-7279, but its benefits are seen global. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2612) ALS has data skew for popular product
[ https://issues.apache.org/jira/browse/SPARK-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069883#comment-14069883 ] Apache Spark commented on SPARK-2612: - User 'renozhang' has created a pull request for this issue: https://github.com/apache/spark/pull/1521 ALS has data skew for popular product - Key: SPARK-2612 URL: https://issues.apache.org/jira/browse/SPARK-2612 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Peng Zhang Usually there are some popular products which are related with many users in Rating inputs. groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather data of the popular product to one task, because it's RDD's partitioner may be not used as the join() partitioner. The following join() need to shuffle from the aggregated product data. The shuffle block can easily be bigger than 2G, and shuffle failed as mentioned in SPARK-1476 And increasing blocks number doesn't work. IMHO, groupByKey() should use the same partitioner as the other RDD in join(). So groupByKey() and join() will be in the same stage, and shuffle data from many previous tasks will not trigger 2G limits. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -PDeb)
Christian Tzolov created SPARK-2614: --- Summary: Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -PDeb) Key: SPARK-2614 URL: https://issues.apache.org/jira/browse/SPARK-2614 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Christian Tzolov The tar.gz distribution includes already the spark-examples.jar in the bundle. It is a common practice for installers to run SparkPi as a smoke test to verify that the installation is OK /usr/share/spark/bin/spark-submit \ --num-executors 10 --master yarn-cluster \ --class org.apache.spark.examples.SparkPi \ /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2615) Add == support for HiveQl
Cheng Hao created SPARK-2615: Summary: Add == support for HiveQl Key: SPARK-2615 URL: https://issues.apache.org/jira/browse/SPARK-2615 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor Currently, if passing == other than = in expression of Hive QL, will cause exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2615) Add == support for HiveQl
[ https://issues.apache.org/jira/browse/SPARK-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069934#comment-14069934 ] Apache Spark commented on SPARK-2615: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/1522 Add == support for HiveQl --- Key: SPARK-2615 URL: https://issues.apache.org/jira/browse/SPARK-2615 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Currently, if passing == other than = in expression of Hive QL, will cause exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2616) Update Mesos to 0.19.1
Timothy Chen created SPARK-2616: --- Summary: Update Mesos to 0.19.1 Key: SPARK-2616 URL: https://issues.apache.org/jira/browse/SPARK-2616 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Let's update Mesos to 0.19.1 and verify that it works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2452) Multi-statement input to spark repl does not work
[ https://issues.apache.org/jira/browse/SPARK-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2452. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1441 [https://github.com/apache/spark/pull/1441] Multi-statement input to spark repl does not work - Key: SPARK-2452 URL: https://issues.apache.org/jira/browse/SPARK-2452 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Timothy Hunter Assignee: Prashant Sharma Priority: Blocker Fix For: 1.1.0 Here is an example: {code} scala val x = 4 ; def f() = x x: Int = 4 f: ()Int scala f() console:11: error: $VAL5 is already defined as value $VAL5 val $VAL5 = INSTANCE; {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2615) Add == support for HiveQl
[ https://issues.apache.org/jira/browse/SPARK-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069948#comment-14069948 ] Cheng Hao commented on SPARK-2615: -- https://github.com/apache/spark/pull/1522 Add == support for HiveQl --- Key: SPARK-2615 URL: https://issues.apache.org/jira/browse/SPARK-2615 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Currently, if passing == other than = in expression of Hive QL, will cause exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-2615) Add == support for HiveQl
[ https://issues.apache.org/jira/browse/SPARK-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-2615: - Comment: was deleted (was: https://github.com/apache/spark/pull/1522) Add == support for HiveQl --- Key: SPARK-2615 URL: https://issues.apache.org/jira/browse/SPARK-2615 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Currently, if passing == other than = in expression of Hive QL, will cause exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0
[ https://issues.apache.org/jira/browse/SPARK-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069979#comment-14069979 ] Sean Owen commented on SPARK-2599: -- Yeah they're tracking roughly the same issue but it would be good to note this discussion in SPARK-2479, including the current issue with 0. I'd favor one absolute error method, changing the one caller as needed. If necessary, a second relative error method to accommodate this one use case. almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0 --- Key: SPARK-2599 URL: https://issues.apache.org/jira/browse/SPARK-2599 Project: Spark Issue Type: Bug Components: MLlib Reporter: Doris Xin Priority: Minor DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, would always produce an epsilon of 1 1e-10, causing false failure when comparing very small numbers with 0.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2617) Correct doc and usage of preservesPartitioning
Xiangrui Meng created SPARK-2617: Summary: Correct doc and usage of preservesPartitioning Key: SPARK-2617 URL: https://issues.apache.org/jira/browse/SPARK-2617 Project: Spark Issue Type: Bug Components: Documentation, MLlib, Spark Core Affects Versions: 1.0.1 Reporter: Xiangrui Meng Assignee: Xiangrui Meng The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner`. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2612) ALS has data skew for popular product
[ https://issues.apache.org/jira/browse/SPARK-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2612. -- Resolution: Fixed Fix Version/s: 1.1.0 ALS has data skew for popular product - Key: SPARK-2612 URL: https://issues.apache.org/jira/browse/SPARK-2612 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Peng Zhang Assignee: Peng Zhang Fix For: 1.1.0 Usually there are some popular products which are related with many users in Rating inputs. groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather data of the popular product to one task, because it's RDD's partitioner may be not used as the join() partitioner. The following join() need to shuffle from the aggregated product data. The shuffle block can easily be bigger than 2G, and shuffle failed as mentioned in SPARK-1476 And increasing blocks number doesn't work. IMHO, groupByKey() should use the same partitioner as the other RDD in join(). So groupByKey() and join() will be in the same stage, and shuffle data from many previous tasks will not trigger 2G limits. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2612) ALS has data skew for popular product
[ https://issues.apache.org/jira/browse/SPARK-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2612: - Assignee: Peng Zhang ALS has data skew for popular product - Key: SPARK-2612 URL: https://issues.apache.org/jira/browse/SPARK-2612 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Peng Zhang Assignee: Peng Zhang Fix For: 1.1.0 Usually there are some popular products which are related with many users in Rating inputs. groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather data of the popular product to one task, because it's RDD's partitioner may be not used as the join() partitioner. The following join() need to shuffle from the aggregated product data. The shuffle block can easily be bigger than 2G, and shuffle failed as mentioned in SPARK-1476 And increasing blocks number doesn't work. IMHO, groupByKey() should use the same partitioner as the other RDD in join(). So groupByKey() and join() will be in the same stage, and shuffle data from many previous tasks will not trigger 2G limits. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -PDeb)
[ https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070197#comment-14070197 ] Apache Spark commented on SPARK-2614: - User 'tzolov' has created a pull request for this issue: https://github.com/apache/spark/pull/1527 Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -PDeb) -- Key: SPARK-2614 URL: https://issues.apache.org/jira/browse/SPARK-2614 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Christian Tzolov The tar.gz distribution includes already the spark-examples.jar in the bundle. It is a common practice for installers to run SparkPi as a smoke test to verify that the installation is OK /usr/share/spark/bin/spark-submit \ --num-executors 10 --master yarn-cluster \ --class org.apache.spark.examples.SparkPi \ /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb)
[ https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Tzolov updated SPARK-2614: Summary: Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb) (was: Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -PDeb)) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb) -- Key: SPARK-2614 URL: https://issues.apache.org/jira/browse/SPARK-2614 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Christian Tzolov The tar.gz distribution includes already the spark-examples.jar in the bundle. It is a common practice for installers to run SparkPi as a smoke test to verify that the installation is OK /usr/share/spark/bin/spark-submit \ --num-executors 10 --master yarn-cluster \ --class org.apache.spark.examples.SparkPi \ /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2618) use config spark.scheduler.priority for specifying TaskSet's priority on DAGScheduler
Lianhui Wang created SPARK-2618: --- Summary: use config spark.scheduler.priority for specifying TaskSet's priority on DAGScheduler Key: SPARK-2618 URL: https://issues.apache.org/jira/browse/SPARK-2618 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement
[ https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070227#comment-14070227 ] Twinkle Sachdeva commented on SPARK-2604: - I tried running in yarn-cluster mode. After setting property of spark.yarn.max.executor.failures to some number. Application do gets failed, but with misleading exception ( pasted at the end ). Instead of handling the condition this way, probably we should be doing the check for the overhead memory amount at the validation itself. Please share your thoughts, if you think otherwise. Stacktrace : Application application_1405933848949_0024 failed 2 times due to Error launching appattempt_1405933848949_0024_02. Got exception: java.net.ConnectException: Call From NN46/192.168.156.46 to localhost:51322 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1414) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy28.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:118) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Spark Application hangs on yarn in edge case scenario of executor memory requirement Key: SPARK-2604 URL: https://issues.apache.org/jira/browse/SPARK-2604 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Twinkle Sachdeva In yarn environment, let's say : MaxAM = Maximum allocatable memory ExecMem - Executor's memory if (MaxAM ExecMem ( MaxAM - ExecMem) 384m )) then Maximum resource validation fails w.r.t executor memory , and application master gets launched, but when resource is allocated and again validated, they are returned and application appears to be hanged. Typical use case is to ask for executor memory = maximum allowed memory as per yarn config -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2619) Configurable file-mode for spark/bin folder in the .deb package.
Christian Tzolov created SPARK-2619: --- Summary: Configurable file-mode for spark/bin folder in the .deb package. Key: SPARK-2619 URL: https://issues.apache.org/jira/browse/SPARK-2619 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Christian Tzolov Currently the /bin folder in the .dep package is hardcoded to 744. So only the Root user (deb.user defaults to root) can run Spark jobs. If we make /bin filemode a configural maven property then we easily generate a package with less restrictive execution rights. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2446) Add BinaryType support to Parquet I/O.
[ https://issues.apache.org/jira/browse/SPARK-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070310#comment-14070310 ] Teng Qiu commented on SPARK-2446: - hi [~marmbrus] impala creating parquet file also without UTF8 annotation for strings, i just tried the newest impala release in CDH 5.1.0, it is still so. is it possible, that add one more parameter in sqlContext.parquetFile(), to disable/enable BinaryType ? i am not sure if it's worth, but it is more flexible, and in our use case, we have many parquet files they were created by impala... :) for example change def parquetFile(path: String): SchemaRDD to def parquetFile(path: String, allowBinaryType: Boolean = true): SchemaRDD user can call sqlContext.parquetFile(xxx, false) to access parquet files made by old spark version and impala. then it is backward compatible, what do you think? Add BinaryType support to Parquet I/O. -- Key: SPARK-2446 URL: https://issues.apache.org/jira/browse/SPARK-2446 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0 To support {{BinaryType}}, the following changes are needed: - Make {{StringType}} use {{OriginalType.UTF8}} - Add {{BinaryType}} using {{PrimitiveTypeName.BINARY}} without {{OriginalType}} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2446) Add BinaryType support to Parquet I/O.
[ https://issues.apache.org/jira/browse/SPARK-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070310#comment-14070310 ] Teng Qiu edited comment on SPARK-2446 at 7/22/14 2:47 PM: -- hi [~marmbrus] impala creating parquet file also without UTF8 annotation for strings, i just tried the newest impala release in CDH 5.1.0, it is still so. is it possible, that add one more parameter in sqlContext.parquetFile(), to disable/enable BinaryType ? i am not sure if it's worth, but it is more flexible, and in our use case, we have many parquet files they were created by impala... :) for example change def parquetFile(path: String): SchemaRDD to def parquetFile(path: String, allowBinaryType: Boolean = true): SchemaRDD by default allowBinaryType is set to true, but user can call sqlContext.parquetFile(xxx, false) to access parquet files made by old spark version and impala. then it is backward compatible, what do you think? was (Author: chutium): hi [~marmbrus] impala creating parquet file also without UTF8 annotation for strings, i just tried the newest impala release in CDH 5.1.0, it is still so. is it possible, that add one more parameter in sqlContext.parquetFile(), to disable/enable BinaryType ? i am not sure if it's worth, but it is more flexible, and in our use case, we have many parquet files they were created by impala... :) for example change def parquetFile(path: String): SchemaRDD to def parquetFile(path: String, allowBinaryType: Boolean = true): SchemaRDD user can call sqlContext.parquetFile(xxx, false) to access parquet files made by old spark version and impala. then it is backward compatible, what do you think? Add BinaryType support to Parquet I/O. -- Key: SPARK-2446 URL: https://issues.apache.org/jira/browse/SPARK-2446 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0 To support {{BinaryType}}, the following changes are needed: - Make {{StringType}} use {{OriginalType.UTF8}} - Add {{BinaryType}} using {{PrimitiveTypeName.BINARY}} without {{OriginalType}} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070367#comment-14070367 ] Ken Carlile commented on SPARK-2282: Hi Aaron, Another question for you. Would it work for me to just drop the two changed files into our install of Spark 1.0.1 release copy, or is that likely to cause issues? Thanks, Ken PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2620) case class cannot be used as key for reduce
Gerard Maas created SPARK-2620: -- Summary: case class cannot be used as key for reduce Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070439#comment-14070439 ] Sean Owen commented on SPARK-2620: -- Duplicate of https://issues.apache.org/jira/browse/SPARK-1199 I think? case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2621) Update task InputMetrics incrementally
Sandy Ryza created SPARK-2621: - Summary: Update task InputMetrics incrementally Key: SPARK-2621 URL: https://issues.apache.org/jira/browse/SPARK-2621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Helena Edelson updated SPARK-2593: -- Issue Type: Improvement (was: Brainstorming) Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. If it makes sense... I would like to create an Akka Extension that wraps around Spark/Spark Streaming and Cassandra. So the creation would simply be this for a user val extension = SparkCassandra(system) and using is as easy as: import extension._ spark. // do work or, streaming. // do work and all config comes from reference.conf and user overrides of that. The conf file would pick up settings from the deployed environment first, then fallback to -D with a final fallback to configured settings. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Helena Edelson updated SPARK-2593: -- Description: As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. I would like to create an Akka Extension that wraps around Spark/Spark Streaming and Cassandra. So the programmatic creation would simply be this for a user val extension = SparkCassandra(system) was: As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. If it makes sense... I would like to create an Akka Extension that wraps around Spark/Spark Streaming and Cassandra. So the creation would simply be this for a user val extension = SparkCassandra(system) and using is as easy as: import extension._ spark. // do work or, streaming. // do work and all config comes from reference.conf and user overrides of that. The conf file would pick up settings from the deployed environment first, then fallback to -D with a final fallback to configured settings. Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. I would like to create an Akka Extension that wraps around Spark/Spark Streaming and Cassandra. So the programmatic creation would simply be this for a user val extension = SparkCassandra(system) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070443#comment-14070443 ] Gerard Maas commented on SPARK-2620: [~sowen] No, doesn't look like it is. case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2615) Add == support for HiveQl
[ https://issues.apache.org/jira/browse/SPARK-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070448#comment-14070448 ] Yin Huai commented on SPARK-2615: - Based on Hive language manual (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF), == is invalid. But, Hive actually treats == as =. I have sent an email to hive-dev list to ask it. Add == support for HiveQl --- Key: SPARK-2615 URL: https://issues.apache.org/jira/browse/SPARK-2615 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Currently, if passing == other than = in expression of Hive QL, will cause exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070476#comment-14070476 ] Evan Chan commented on SPARK-2593: -- I would say that the base SparkContext should have this ability, after all it creates multiple actors as well as a base ActorSystem. Sharing a single ActorSystem would also speed up all of Spark's tests, many of which repeatedly create and then tear down ActorSystems (since that's what the base SparkContext does). Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. I would like to create an Akka Extension that wraps around Spark/Spark Streaming and Cassandra. So the programmatic creation would simply be this for a user val extension = SparkCassandra(system) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2622) Add Jenkins build numbers to SparkQA messages
Xiangrui Meng created SPARK-2622: Summary: Add Jenkins build numbers to SparkQA messages Key: SPARK-2622 URL: https://issues.apache.org/jira/browse/SPARK-2622 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.0.1 Reporter: Xiangrui Meng Priority: Minor It takes Jenkins 2 hours to finish testing. It is possible to have the following: {code} Build 1 started. PR updated. Build 2 started. Build 1 finished successfully. A committer merged the PR because the last build seemed to be okay. Build 2 failed. {code} It would be nice to put the build number in the SparkQA message so it is easy to match the result with the build. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070528#comment-14070528 ] Aaron Davidson commented on SPARK-2282: --- Great to hear! These files haven't been changed since the 1.0.1 release besides this patch, so it should be fine to just drop them in. (A generally safer option would be to do a git merge, though, against Spark's refs/pull/1503/head branch.) PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070543#comment-14070543 ] Daniel Siegmann commented on SPARK-2620: I have confirmed this on Spark 1.0.1 as well. As Gerard noted on the mailing list, this bug does NOT affect Spark 0.9.1 (confirmed in Spark shell). case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2623) Stacked Auto Encoder (Deep Learning )
Victor Fang created SPARK-2623: -- Summary: Stacked Auto Encoder (Deep Learning ) Key: SPARK-2623 URL: https://issues.apache.org/jira/browse/SPARK-2623 Project: Spark Issue Type: New Feature Reporter: Victor Fang We would like to add parallel implementation of Stacked Auto Encoder (Deep Learning ) algorithm to Spark MLLib. SAE is one of the most popular Deep Learning algorithms. It has achieved successful benchmarks in MNIST hand written classifications, Google's ICML2012 cat face paper (http://icml.cc/2012/papers/73.pdf), etc. Our focus is to leverage the RDD and get the SAE with the following capability with ease of use for both beginners and advanced researchers: 1, multi layer SAE deep network training and scoring. 2, unsupervised feature learning. 3, supervised learning with multinomial logistic regression (softmax). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2623) Stacked Auto Encoder (Deep Learning )
[ https://issues.apache.org/jira/browse/SPARK-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2623: - Assignee: Victor Fang Stacked Auto Encoder (Deep Learning ) - Key: SPARK-2623 URL: https://issues.apache.org/jira/browse/SPARK-2623 Project: Spark Issue Type: New Feature Reporter: Victor Fang Assignee: Victor Fang Labels: deeplearning, machine_learning We would like to add parallel implementation of Stacked Auto Encoder (Deep Learning ) algorithm to Spark MLLib. SAE is one of the most popular Deep Learning algorithms. It has achieved successful benchmarks in MNIST hand written classifications, Google's ICML2012 cat face paper (http://icml.cc/2012/papers/73.pdf), etc. Our focus is to leverage the RDD and get the SAE with the following capability with ease of use for both beginners and advanced researchers: 1, multi layer SAE deep network training and scoring. 2, unsupervised feature learning. 3, supervised learning with multinomial logistic regression (softmax). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2624) Datanucleus jars not accessible in yarn-cluster mode
Andrew Or created SPARK-2624: Summary: Datanucleus jars not accessible in yarn-cluster mode Key: SPARK-2624 URL: https://issues.apache.org/jira/browse/SPARK-2624 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.1 Reporter: Andrew Or Fix For: 1.1.0 This is because we add it to the class path of the command that launches spark submit, but the containers never get it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070604#comment-14070604 ] Aaron edited comment on SPARK-2620 at 7/22/14 6:05 PM: --- If you look at the diff of distinct from branch-0.9 to master you see - def distinct(numPartitions: Int): RDD[T] = + def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = Is it possible that case classes don't have an implicit ordering and that is why this fails? was (Author: aaronjosephs): If you look at the diff of distinct from branch-0.9 to master you see - def distinct(numPartitions: Int): RDD[T] = + def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = Is it possible that case classes don't have an implicit ordering and that is why this fails? case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070604#comment-14070604 ] Aaron commented on SPARK-2620: -- If you look at the diff of distinct from branch-0.9 to master you see - def distinct(numPartitions: Int): RDD[T] = + def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = Is it possible that case classes don't have an implicit ordering and that is why this fails? case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070604#comment-14070604 ] Aaron edited comment on SPARK-2620 at 7/22/14 6:05 PM: --- If you look at the diff of distinct from branch-0.9 to master you see - def distinct(numPartitions: Int): RDD[T] = + def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = Is it possible that case classes don't have an implicit ordering and that is why this fails? was (Author: aaronjosephs): If you look at the diff of distinct from branch-0.9 to master you see - def distinct(numPartitions: Int): RDD[T] = + def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = Is it possible that case classes don't have an implicit ordering and that is why this fails? case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2625) Fix ShuffleReadMetrics for NettyBlockFetcherIterator
Sandy Ryza created SPARK-2625: - Summary: Fix ShuffleReadMetrics for NettyBlockFetcherIterator Key: SPARK-2625 URL: https://issues.apache.org/jira/browse/SPARK-2625 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.0.0 Reporter: Sandy Ryza Priority: Minor NettyBlockFetcherIterator doesn't report fetchWaitTime and has some race conditions where multiple threads can be incrementing bytes read at the same time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2618) use config spark.scheduler.priority for specifying TaskSet's priority on DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070691#comment-14070691 ] Patrick Wendell commented on SPARK-2618: Can you explain more what you are trying to accomplish here? The Fair scheduler is designed to allow different scheduling policies. Prioritization of jobs is not handled in the DAGScheduler. use config spark.scheduler.priority for specifying TaskSet's priority on DAGScheduler - Key: SPARK-2618 URL: https://issues.apache.org/jira/browse/SPARK-2618 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator
[ https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2047. -- Resolution: Fixed Fix Version/s: 1.1.0 Use less memory in AppendOnlyMap.destructiveSortedIterator -- Key: SPARK-2047 URL: https://issues.apache.org/jira/browse/SPARK-2047 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Assignee: Aaron Davidson Fix For: 1.1.0 This method tries to sort an the key-value pairs in the map in-place but ends up allocating a Tuple2 object for each one, which allocates a nontrivial amount of memory (32 or more bytes per entry on a 64-bit JVM). We could instead try to sort the objects in-place within the data array, or allocate an int array with the indices and sort those using a custom comparator. The latter is probably easiest to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator
[ https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2047: - Assignee: Aaron Davidson Use less memory in AppendOnlyMap.destructiveSortedIterator -- Key: SPARK-2047 URL: https://issues.apache.org/jira/browse/SPARK-2047 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Assignee: Aaron Davidson Fix For: 1.1.0 This method tries to sort an the key-value pairs in the map in-place but ends up allocating a Tuple2 object for each one, which allocates a nontrivial amount of memory (32 or more bytes per entry on a 64-bit JVM). We could instead try to sort the objects in-place within the data array, or allocate an int array with the indices and sort those using a custom comparator. The latter is probably easiest to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2626) Stop SparkContext in all examples
Andrew Or created SPARK-2626: Summary: Stop SparkContext in all examples Key: SPARK-2626 URL: https://issues.apache.org/jira/browse/SPARK-2626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Andrew Or Fix For: 1.1.0 Event logs rely on sc.stop() to close the file. If this is never closed, the history server will not be able to find the logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2627) Check for PEP 8 compliance on all Python code in the Jenkins CI cycle
Nicholas Chammas created SPARK-2627: --- Summary: Check for PEP 8 compliance on all Python code in the Jenkins CI cycle Key: SPARK-2627 URL: https://issues.apache.org/jira/browse/SPARK-2627 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas This issue was triggered by [the discussion here|https://github.com/apache/spark/pull/1505#issuecomment-49698681]. Requirements: * make a linter script for Scala under {{dev/lint-scala}} that just calls {{scalastyle}} * make a linter script for Python under {{dev/lint-python}} that calls {{pep8}} on all Python files ** One exception to this is {{cloudpickle.py}}, which is a third-party module [we don't want to touch|https://github.com/apache/spark/pull/1505#discussion-diff-15197904] * Modify {{dev/run-tests}} to call both linter scripts * Incorporate these changes into the [Contributing to Spark|https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark] guide -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2628) Mesos backend throwing unable to find LoginModule
Timothy Chen created SPARK-2628: --- Summary: Mesos backend throwing unable to find LoginModule Key: SPARK-2628 URL: https://issues.apache.org/jira/browse/SPARK-2628 Project: Spark Issue Type: Bug Reporter: Timothy Chen http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-1,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1166) leftover vpc_id may block the creation of new ec2 cluster
[ https://issues.apache.org/jira/browse/SPARK-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070823#comment-14070823 ] bruce szalwinski commented on SPARK-1166: - I've been able to reproduce, but not consistently. On this occasion, I had previously started an instance, didn't use it for anything and soon there after shut it down. Don't know if that means anything. Using spark 1.0 from spark-ec2 -k mykeypair -i ~/.aws/mykeypair.pem -s 7 -r us-west-2 -t r3.large launch sparck-cluster Setting up security groups... Creating security group sparck-cluster-master Creating security group sparck-cluster-slaves ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidGroup.NotFound/CodeMessageThe security group 'sg-185fe07d' does not exist/Message/Error/ErrorsRequestID80f1e1e3-e340-4cd2-ba64-53c13525ab2b/RequestID/Response Traceback (most recent call last): File ./spark_ec2.py, line 909, in module main() File ./spark_ec2.py, line 901, in main real_main() File ./spark_ec2.py, line 779, in real_main (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name) File ./spark_ec2.py, line 279, in launch_cluster master_group.authorize(src_group=slave_group) File /Users/szalwinb/playpen/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/securitygroup.py, line 184, in authorize File /Users/szalwinb/playpen/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py, line 2150, in authorize_security_group File /Users/szalwinb/playpen/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py, line 2093, in authorize_security_group_deprecated File /Users/szalwinb/playpen/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py, line 944, in get_status boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidGroup.NotFound/CodeMessageThe security group 'sg-185fe07d' does not exist/Message/Error/ErrorsRequestID80f1e1e3-e340-4cd2-ba64-53c13525ab2b/RequestID/Response leftover vpc_id may block the creation of new ec2 cluster - Key: SPARK-1166 URL: https://issues.apache.org/jira/browse/SPARK-1166 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: Nan Zhu Assignee: Nan Zhu When I run the spark-ec2 script to build ec2 cluster in EC2, for some reason, I always received errors as following: {code} Setting up security groups... ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'null' for protocol. VPC security group rules must specify protocols explicitly./Message/Error/ErrorsRequestIDfc56f0ba-915a-45b6-8555-05d4dd0f14ee/RequestID/Response Traceback (most recent call last): File ./spark_ec2.py, line 813, in module main() File ./spark_ec2.py, line 806, in main real_main() File ./spark_ec2.py, line 689, in real_main conn, opts, cluster_name) File ./spark_ec2.py, line 244, in launch_cluster slave_group.authorize(src_group=master_group) File /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/securitygroup.py, line 184, in authorize File /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py, line 2181, in authorize_security_group File /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py, line 944, in get_status boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'null' for protocol. VPC security group rules must specify protocols explicitly./Message/Error/ErrorsRequestIDfc56f0ba-915a-45b6-8555-05d4dd0f14ee/RequestID/Response {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2628) Mesos backend throwing unable to find LoginModule
[ https://issues.apache.org/jira/browse/SPARK-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070826#comment-14070826 ] Timothy Chen commented on SPARK-2628: - [~pwendell] please assign to me, thanks! Mesos backend throwing unable to find LoginModule -- Key: SPARK-2628 URL: https://issues.apache.org/jira/browse/SPARK-2628 Project: Spark Issue Type: Bug Components: Mesos Reporter: Timothy Chen http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-1,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2628) Mesos backend throwing unable to find LoginModule
[ https://issues.apache.org/jira/browse/SPARK-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen updated SPARK-2628: Component/s: Mesos Mesos backend throwing unable to find LoginModule -- Key: SPARK-2628 URL: https://issues.apache.org/jira/browse/SPARK-2628 Project: Spark Issue Type: Bug Components: Mesos Reporter: Timothy Chen http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-1,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2452) Multi-statement input to spark repl does not work
[ https://issues.apache.org/jira/browse/SPARK-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070841#comment-14070841 ] Timothy Hunter commented on SPARK-2452: --- Excellent, thanks Patrick. Multi-statement input to spark repl does not work - Key: SPARK-2452 URL: https://issues.apache.org/jira/browse/SPARK-2452 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Timothy Hunter Assignee: Prashant Sharma Priority: Blocker Fix For: 1.1.0 Here is an example: {code} scala val x = 4 ; def f() = x x: Int = 4 f: ()Int scala f() console:11: error: $VAL5 is already defined as value $VAL5 val $VAL5 = INSTANCE; {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1166) leftover vpc_id may block the creation of new ec2 cluster
[ https://issues.apache.org/jira/browse/SPARK-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070840#comment-14070840 ] bruce szalwinski commented on SPARK-1166: - To resolve, I go to https://console.aws.amazon.com/vpc/home?region=us-west-2#securityGroups: and manually delete the security groups and then I'm able to start up a cluster. leftover vpc_id may block the creation of new ec2 cluster - Key: SPARK-1166 URL: https://issues.apache.org/jira/browse/SPARK-1166 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: Nan Zhu Assignee: Nan Zhu When I run the spark-ec2 script to build ec2 cluster in EC2, for some reason, I always received errors as following: {code} Setting up security groups... ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'null' for protocol. VPC security group rules must specify protocols explicitly./Message/Error/ErrorsRequestIDfc56f0ba-915a-45b6-8555-05d4dd0f14ee/RequestID/Response Traceback (most recent call last): File ./spark_ec2.py, line 813, in module main() File ./spark_ec2.py, line 806, in main real_main() File ./spark_ec2.py, line 689, in real_main conn, opts, cluster_name) File ./spark_ec2.py, line 244, in launch_cluster slave_group.authorize(src_group=master_group) File /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/securitygroup.py, line 184, in authorize File /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py, line 2181, in authorize_security_group File /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py, line 944, in get_status boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'null' for protocol. VPC security group rules must specify protocols explicitly./Message/Error/ErrorsRequestIDfc56f0ba-915a-45b6-8555-05d4dd0f14ee/RequestID/Response {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2628) Mesos backend throwing unable to find LoginModule
[ https://issues.apache.org/jira/browse/SPARK-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2628: --- Assignee: Tim Chen Mesos backend throwing unable to find LoginModule -- Key: SPARK-2628 URL: https://issues.apache.org/jira/browse/SPARK-2628 Project: Spark Issue Type: Bug Components: Mesos Reporter: Timothy Chen Assignee: Tim Chen http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-1,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
[ https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1642: - Fix Version/s: (was: 1.1.0) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083 --- Key: SPARK-1642 URL: https://issues.apache.org/jira/browse/SPARK-1642 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Ted Malaska Assignee: Ted Malaska Priority: Minor This will add support for SSL encryption between Flume AvroSink and Spark Streaming. It is based on FLUME-2083 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1645: - Issue Type: Improvement (was: Bug) Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1645: - Target Version/s: 1.1.0 Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb)
[ https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070908#comment-14070908 ] Mark Hamstra commented on SPARK-2614: - It's also common for installers/admins to not want all of the examples installed on their production machines. This issue is really touching on the fact that Spark's Debian packaging isn't presently everything that it could be. It's really just a hack to allow deployment tools like Chef to be able to manage Spark. To really do Debian packaging right, we should be creating multiple packages -- perhaps spark-core, spark-sql, spark-mllib, spark-graphx, spark-examples, etc. The only thing that is really preventing us from doing this is that proper Debian packaging would require a package maintainer willing and committed to do all of the maintenance work. Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb) -- Key: SPARK-2614 URL: https://issues.apache.org/jira/browse/SPARK-2614 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Christian Tzolov The tar.gz distribution includes already the spark-examples.jar in the bundle. It is a common practice for installers to run SparkPi as a smoke test to verify that the installation is OK /usr/share/spark/bin/spark-submit \ --num-executors 10 --master yarn-cluster \ --class org.apache.spark.examples.SparkPi \ /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
[ https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1642: - Target Version/s: 1.1.0 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083 --- Key: SPARK-1642 URL: https://issues.apache.org/jira/browse/SPARK-1642 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Ted Malaska Assignee: Ted Malaska Priority: Minor This will add support for SSL encryption between Flume AvroSink and Spark Streaming. It is based on FLUME-2083 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1853: - Target Version/s: 1.1.0 Show Streaming application code context (file, line number) in Spark Stages UI -- Key: SPARK-1853 URL: https://issues.apache.org/jira/browse/SPARK-1853 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Mubarak Seyed Fix For: 1.1.0 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png Right now, the code context (file, and line number) shown for streaming jobs in stages UI is meaningless as it refers to internal DStream:random line rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2464) Twitter Receiver does not stop correctly when streamingContext.stop is called
[ https://issues.apache.org/jira/browse/SPARK-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2464: - Target Version/s: 1.1.0, 1.0.2 Twitter Receiver does not stop correctly when streamingContext.stop is called - Key: SPARK-2464 URL: https://issues.apache.org/jira/browse/SPARK-2464 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.0.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark
[ https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2345: - Issue Type: Wish (was: Bug) ForEachDStream should have an option of running the foreachfunc on Spark Key: SPARK-2345 URL: https://issues.apache.org/jira/browse/SPARK-2345 Project: Spark Issue Type: Wish Components: Streaming Reporter: Hari Shreedharan Today the Job generated simply calls the foreachfunc, but does not run it on spark itself using the sparkContext.runJob method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1854) Add a version of StreamingContext.fileStream that take hadoop conf object
[ https://issues.apache.org/jira/browse/SPARK-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1854: - Fix Version/s: (was: 1.1.0) Add a version of StreamingContext.fileStream that take hadoop conf object - Key: SPARK-1854 URL: https://issues.apache.org/jira/browse/SPARK-1854 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2464) Twitter Receiver does not stop correctly when streamingContext.stop is called
[ https://issues.apache.org/jira/browse/SPARK-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2464: - Fix Version/s: (was: 1.0.2) (was: 1.1.0) Twitter Receiver does not stop correctly when streamingContext.stop is called - Key: SPARK-2464 URL: https://issues.apache.org/jira/browse/SPARK-2464 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.0.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2379) stopReceive in dead loop, cause stackoverflow exception
[ https://issues.apache.org/jira/browse/SPARK-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070915#comment-14070915 ] Tathagata Das commented on SPARK-2379: -- Any information on this? If we have no way to reproduce this, I will close this. stopReceive in dead loop, cause stackoverflow exception --- Key: SPARK-2379 URL: https://issues.apache.org/jira/browse/SPARK-2379 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0 Reporter: sunsc streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala stop will call stopReceiver and stopReceiver will call stop if exception occurs, that make a dead loop. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2447: - Target Version/s: 1.1.0 Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2377) Create a Python API for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2377: - Target Version/s: 1.1.0 Create a Python API for Spark Streaming --- Key: SPARK-2377 URL: https://issues.apache.org/jira/browse/SPARK-2377 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Nicholas Chammas [Spark Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html] currently offers APIs in Scala and Java. It would be great feature add to have a Python API as well. This is probably a large task that will span many issues if undertaken. This ticket should provide some place to track overall progress towards an initial Python API for Spark Streaming. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2377) Create a Python API for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2377: - Fix Version/s: (was: 1.1.0) Create a Python API for Spark Streaming --- Key: SPARK-2377 URL: https://issues.apache.org/jira/browse/SPARK-2377 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Nicholas Chammas [Spark Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html] currently offers APIs in Scala and Java. It would be great feature add to have a Python API as well. This is probably a large task that will span many issues if undertaken. This ticket should provide some place to track overall progress towards an initial Python API for Spark Streaming. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-2377) Create a Python API for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-2377: Assignee: Tathagata Das Create a Python API for Spark Streaming --- Key: SPARK-2377 URL: https://issues.apache.org/jira/browse/SPARK-2377 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Nicholas Chammas Assignee: Tathagata Das [Spark Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html] currently offers APIs in Scala and Java. It would be great feature add to have a Python API as well. This is probably a large task that will span many issues if undertaken. This ticket should provide some place to track overall progress towards an initial Python API for Spark Streaming. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2447: - Component/s: Streaming Spark Core Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Tathagata Das Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1729) Make Flume pull data from source, rather than the current push model
[ https://issues.apache.org/jira/browse/SPARK-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1729: - Assignee: Hari Shreedharan (was: Tathagata Das) Make Flume pull data from source, rather than the current push model Key: SPARK-1729 URL: https://issues.apache.org/jira/browse/SPARK-1729 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Hari Shreedharan Fix For: 1.1.0 This makes sure that the if the Spark executor running the receiver goes down, the new receiver on a new node can still get data from Flume. This is not possible in the current model, as Flume is configured to push data to a executor/worker and if that worker is down, Flume cant push data. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0
[ https://issues.apache.org/jira/browse/SPARK-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070929#comment-14070929 ] DB Tsai commented on SPARK-2599: I'm the original guy implementing `almostEquals` for my unit-testing, and I also noticed that it will be suffering when comparing against 0.0. As [~srowen] pointed out, it's meaningless to comparing against 0.0 (or a really small number) with relative error. However, people may just want to write unittest using relative error for even comparing those small numbers. So I purpose the following APIs. `a ~== b +- eps` for relative error, and when a or b near zero, let's say 1e-15, it falls back to absolute error. `a === b +- eps` which is already in scalatest 2.0 for absolute error, but since we don't use scalatest 2.0 yet, we build the same APIs in mllib for absolute error. almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0 --- Key: SPARK-2599 URL: https://issues.apache.org/jira/browse/SPARK-2599 Project: Spark Issue Type: Bug Components: MLlib Reporter: Doris Xin Priority: Minor DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, would always produce an epsilon of 1 1e-10, causing false failure when comparing very small numbers with 0.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1730) Make receiver store data reliably to avoid data-loss on executor failures
[ https://issues.apache.org/jira/browse/SPARK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1730: - Assignee: Hari Shreedharan Make receiver store data reliably to avoid data-loss on executor failures - Key: SPARK-1730 URL: https://issues.apache.org/jira/browse/SPARK-1730 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Hari Shreedharan Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2438) Streaming + MLLib
[ https://issues.apache.org/jira/browse/SPARK-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2438: - Target Version/s: 1.1.0 Streaming + MLLib - Key: SPARK-2438 URL: https://issues.apache.org/jira/browse/SPARK-2438 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Reporter: Jeremy Freeman Assignee: Jeremy Freeman Labels: features This is a ticket to track progress on developing streaming analyses in MLLib. Many streaming applications benefit from or require fitting models online, where the parameters of a model (e.g. regression, clustering) are updated continually as new data arrive. This can be accomplished by incorporating MLLib algorithms into model-updating operations over DStreams. In some cases this can be achieved using existing updaters (e.g. those based on SGD), but in other cases will require custom update rules (e.g. for KMeans). The goal is to have streaming versions of many common algorithms, in particular regression, classification, clustering, and possibly dimensionality reduction. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2438) Streaming + MLLib
[ https://issues.apache.org/jira/browse/SPARK-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2438: - Component/s: Streaming Streaming + MLLib - Key: SPARK-2438 URL: https://issues.apache.org/jira/browse/SPARK-2438 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Reporter: Jeremy Freeman Assignee: Jeremy Freeman Labels: features This is a ticket to track progress on developing streaming analyses in MLLib. Many streaming applications benefit from or require fitting models online, where the parameters of a model (e.g. regression, clustering) are updated continually as new data arrive. This can be accomplished by incorporating MLLib algorithms into model-updating operations over DStreams. In some cases this can be achieved using existing updaters (e.g. those based on SGD), but in other cases will require custom update rules (e.g. for KMeans). The goal is to have streaming versions of many common algorithms, in particular regression, classification, clustering, and possibly dimensionality reduction. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1645: - Issue Type: Improvement (was: New Feature) Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1645: - Issue Type: New Feature (was: Improvement) Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2438) Streaming + MLLib
[ https://issues.apache.org/jira/browse/SPARK-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2438: - Issue Type: New Feature (was: Improvement) Streaming + MLLib - Key: SPARK-2438 URL: https://issues.apache.org/jira/browse/SPARK-2438 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman Assignee: Jeremy Freeman Labels: features This is a ticket to track progress on developing streaming analyses in MLLib. Many streaming applications benefit from or require fitting models online, where the parameters of a model (e.g. regression, clustering) are updated continually as new data arrive. This can be accomplished by incorporating MLLib algorithms into model-updating operations over DStreams. In some cases this can be achieved using existing updaters (e.g. those based on SGD), but in other cases will require custom update rules (e.g. for KMeans). The goal is to have streaming versions of many common algorithms, in particular regression, classification, clustering, and possibly dimensionality reduction. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1730) Make receiver store data reliably to avoid data-loss on executor failures
[ https://issues.apache.org/jira/browse/SPARK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1730: - Target Version/s: 1.1.0 Fix Version/s: (was: 1.1.0) Make receiver store data reliably to avoid data-loss on executor failures - Key: SPARK-1730 URL: https://issues.apache.org/jira/browse/SPARK-1730 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2548) JavaRecoverableWordCount is missing
[ https://issues.apache.org/jira/browse/SPARK-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2548: - Target Version/s: 1.1.0, 1.0.2, 0.9.3 (was: 1.1.0, 0.9.3) JavaRecoverableWordCount is missing --- Key: SPARK-2548 URL: https://issues.apache.org/jira/browse/SPARK-2548 Project: Spark Issue Type: Bug Components: Documentation, Streaming Affects Versions: 0.9.2, 1.0.1 Reporter: Xiangrui Meng Priority: Minor JavaRecoverableWordCount was mentioned in the doc but not in the codebase. We need to rewrite the example because the code was lost during the migration from spark/spark-incubating to apache/spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2629) Improve performance of DStream.updateStateByKey using IndexRDD
Tathagata Das created SPARK-2629: Summary: Improve performance of DStream.updateStateByKey using IndexRDD Key: SPARK-2629 URL: https://issues.apache.org/jira/browse/SPARK-2629 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2629) Improve performance of DStream.updateStateByKey using IndexRDD
[ https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070939#comment-14070939 ] Tathagata Das commented on SPARK-2629: -- Index RDD is necessary for this improvement to be made. Improve performance of DStream.updateStateByKey using IndexRDD -- Key: SPARK-2629 URL: https://issues.apache.org/jira/browse/SPARK-2629 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
[ https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070938#comment-14070938 ] Ted Malaska commented on SPARK-1642: Are there any changes needed here? Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083 --- Key: SPARK-1642 URL: https://issues.apache.org/jira/browse/SPARK-1642 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Ted Malaska Assignee: Ted Malaska Priority: Minor This will add support for SSL encryption between Flume AvroSink and Spark Streaming. It is based on FLUME-2083 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070966#comment-14070966 ] Marcelo Vanzin commented on SPARK-2420: --- I'm all for sanitizing dependencies, but just be aware that this change might break people's applications. Any application using APIs available in Guava 11 only, and relying on the transitive dependency of Spark on Guava, and on the Guava classes being packaged with the Spark uber-jar, will break. This is the main reason why all this dependency stuff is tricky. As soon as you package things in a particular way, people start depending on it, and now you always have to keep that in consideration when you make a change. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1853: - Assignee: Mubarak Seyed (was: Tathagata Das) Show Streaming application code context (file, line number) in Spark Stages UI -- Key: SPARK-1853 URL: https://issues.apache.org/jira/browse/SPARK-1853 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Mubarak Seyed Fix For: 1.1.0 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png Right now, the code context (file, and line number) shown for streaming jobs in stages UI is meaningless as it refers to internal DStream:random line rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2630) Input data size goes overflow when size is large then 4G in one task
Davies Liu created SPARK-2630: - Summary: Input data size goes overflow when size is large then 4G in one task Key: SPARK-2630 URL: https://issues.apache.org/jira/browse/SPARK-2630 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Priority: Critical Given one big file, such as text.4.3G, put it in one task, sc.textFile(text.4.3.G).coalesce(1).count() In Web UI of Spark, you will see that the input size is 5.4M. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2630) Input data size goes overflow when size is large then 4G in one task
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-2630: -- Attachment: overflow.tiff The input size is showed as 5.8MB, but the real input size is 4.3G. Input data size goes overflow when size is large then 4G in one task Key: SPARK-2630 URL: https://issues.apache.org/jira/browse/SPARK-2630 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Priority: Critical Attachments: overflow.tiff Given one big file, such as text.4.3G, put it in one task, sc.textFile(text.4.3.G).coalesce(1).count() In Web UI of Spark, you will see that the input size is 5.4M. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2631) In-memory Compression is not configured with SQLConf
Michael Armbrust created SPARK-2631: --- Summary: In-memory Compression is not configured with SQLConf Key: SPARK-2631 URL: https://issues.apache.org/jira/browse/SPARK-2631 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071039#comment-14071039 ] Patrick Wendell commented on SPARK-2282: [~carlilek] I'd actually recommend just pulling Spark from the branch-1.0 maintaince branch. We usually recommend users do this since we only add stability fixes on those branches. PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2426: - Target Version/s: (was: 1.1.0) Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add ADMM and IPM based QuadraticMinimization solvers to breeze.optimize.quadratic package. 2. Add a QpSolver object in spark mllib optimization which calls breeze 3. Add the QpSolver object in spark mllib ALS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2613) CLONE - word2vec: Distributed Representation of Words
[ https://issues.apache.org/jira/browse/SPARK-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2613. -- Resolution: Duplicate CLONE - word2vec: Distributed Representation of Words - Key: SPARK-2613 URL: https://issues.apache.org/jira/browse/SPARK-2613 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yifan Yang Assignee: Liquan Pei Original Estimate: 672h Remaining Estimate: 672h We would like to add parallel implementation of word2vec to MLlib. word2vec finds distributed representation of words through training of large data sets. The Spark programming model fits nicely with word2vec as the training algorithm of word2vec is embarrassingly parallel. We will focus on skip-gram model and negative sampling in our initial implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1545) Add Random Forest algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1545: - Target Version/s: (was: 1.1.0) Add Random Forest algorithm to MLlib Key: SPARK-1545 URL: https://issues.apache.org/jira/browse/SPARK-1545 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Manish Amde Assignee: Manish Amde This task requires adding Random Forest support to Spark MLlib. The implementation needs to adapt the classic algorithm to the scalable tree implementation. The tasks involves: - Comparing the various tradeoffs and finalizing the algorithm before implementation - Code implementation - Unit tests - Functional tests - Performance tests - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1547: - Target Version/s: (was: 1.1.0) Add gradient boosting algorithm to MLlib Key: SPARK-1547 URL: https://issues.apache.org/jira/browse/SPARK-1547 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Manish Amde Assignee: Manish Amde This task requires adding the gradient boosting algorithm to Spark MLlib. The implementation needs to adapt the gradient boosting algorithm to the scalable tree implementation. The tasks involves: - Comparing the various tradeoffs and finalizing the algorithm before implementation - Code implementation - Unit tests - Functional tests - Performance tests - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2010) Support for nested data in PySpark SQL
[ https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2010: Priority: Blocker (was: Critical) Support for nested data in PySpark SQL -- Key: SPARK-2010 URL: https://issues.apache.org/jira/browse/SPARK-2010 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Kan Zhang Priority: Blocker -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2632) Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff.
Yin Huai created SPARK-2632: --- Summary: Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff. Key: SPARK-2632 URL: https://issues.apache.org/jira/browse/SPARK-2632 Project: Spark Issue Type: Bug Reporter: Yin Huai Priority: Blocker To reproduce the exception, you can start a local cluster (sbin/start-all.sh) then open a spark shell. {code} class X() { println(What!); def y = 3 } val x = new X import x.y case class Person(name: String, age: Int) sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)).collect {code} Then you will find the exception. I am attaching the stack trace below... {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: $iwC$$iwC$$iwC$$iwC$X at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2632) Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff.
[ https://issues.apache.org/jira/browse/SPARK-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2632: Description: Master is affected by this bug. To reproduce the exception, you can start a local cluster (sbin/start-all.sh) then open a spark shell. {code} class X() { println(What!); def y = 3 } val x = new X import x.y case class Person(name: String, age: Int) sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)).collect {code} Then you will find the exception. I am attaching the stack trace below... {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: $iwC$$iwC$$iwC$$iwC$X at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} was: To reproduce the exception, you can start a local cluster (sbin/start-all.sh) then open a spark shell. {code} class X() { println(What!); def y = 3 } val x = new X import x.y case class Person(name: String, age: Int) sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)).collect {code} Then you will find the exception. I am attaching the stack trace below... {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: $iwC$$iwC$$iwC$$iwC$X at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071192#comment-14071192 ] Aaron Davidson commented on SPARK-2282: --- [~pwendell] That would in general be the right solution, but this particular change hasn't been merged yet (referring to the second PR on this bug, which is a more complete fix). PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071197#comment-14071197 ] Patrick Wendell commented on SPARK-2282: Ah my b. I was confused. PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2615) Add == support for HiveQl
[ https://issues.apache.org/jira/browse/SPARK-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071214#comment-14071214 ] Cheng Hao commented on SPARK-2615: -- Yes, that's true. But == is actually used in lots of unit test queries, without this being fixed in SparkSQL, some of them may not able to pass the unit test. For example: https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/groupby_grouping_id1.q Probably we should create a Jira issue for Hive, either for the documentation or source code. Add == support for HiveQl --- Key: SPARK-2615 URL: https://issues.apache.org/jira/browse/SPARK-2615 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Currently, if passing == other than = in expression of Hive QL, will cause exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb)
[ https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071220#comment-14071220 ] Christian Tzolov commented on SPARK-2614: - Fair point [~markhamstra]. I agree about the benefits of having dedicated packages. Actually I've tried a patch that wraps the examples.jar in a separate spark-examples debian package. The code is here: https://github.com/tzolov/spark/tree/SPARK-2614-2 Shall i pull a request for this implementation instead? At least until the proper packaging come in place. Regarding the proper Debian (and hopefully RPM) packaging I am afraid i have very little of experience with this subject to make it right. Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb) -- Key: SPARK-2614 URL: https://issues.apache.org/jira/browse/SPARK-2614 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Christian Tzolov The tar.gz distribution includes already the spark-examples.jar in the bundle. It is a common practice for installers to run SparkPi as a smoke test to verify that the installation is OK /usr/share/spark/bin/spark-submit \ --num-executors 10 --master yarn-cluster \ --class org.apache.spark.examples.SparkPi \ /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-975) Spark Replay Debugger
[ https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071225#comment-14071225 ] Phuoc Do commented on SPARK-975: Cheng Lian, some JS libraries that can draw flow diagrams: http://www.graphdracula.net/ http://www.daviddurman.com/automatic-graph-layout-with-jointjs-and-dagre.html Your diagram in 01/May/14 comment doesn't look like this one: http://spark-replay-debugger-overview.readthedocs.org/en/latest/_static/als-1-large.png Has the requirement changed? Spark Replay Debugger - Key: SPARK-975 URL: https://issues.apache.org/jira/browse/SPARK-975 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 0.9.0 Reporter: Cheng Lian Labels: arthur, debugger Attachments: RDD DAG.png The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf]. [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur Dave|https://github.com/ankurdave], is an old implementation of the Spark debugger, which demonstrated both the elegance and power behind the RDD abstraction. Unfortunately, the corresponding GitHub branch was not merged into the master branch and had stopped 2 years ago. For more information about Arthur, please refer to [the Spark Debugger Wiki page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub repository. As a useful tool for Spark application debugging and analysis, it would be nice to have a complete Spark debugger. In [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new implementation of the Spark debugger, the Spark Replay Debugger (SRD). [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview for discussion. In the current version, I only implemented features that can illustrate the basic mechanisms. There are still features appeared in Arthur but missing in SRD, such as checksum based nondeterminsm detection and single task debugging with conventional debugger (like {{jdb}}). However, these features can be easily built upon current SRD framework. To minimize code review effort, I didn't include them into the current version intentionally. Attached is the visualization of the MLlib ALS application (with 1 iteration) generated by SRD. For more information, please refer to [the SRD overview document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-975) Spark Replay Debugger
[ https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071232#comment-14071232 ] Cheng Lian commented on SPARK-975: -- Hey [~phuocd], that image actually shows exactly the same issue as I commented. Take RDD #0 to #4 as an example: #0 to #3 form a stage, and #0, #1, #2 and #4 form another. These two stages share #0 to #2, and should overlap, and the generated dot file describes the topology correctly. But GraphVis gives wrong bounding boxes, just like the right part of the image I used in the comment. Spark Replay Debugger - Key: SPARK-975 URL: https://issues.apache.org/jira/browse/SPARK-975 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 0.9.0 Reporter: Cheng Lian Labels: arthur, debugger Attachments: RDD DAG.png The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf]. [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur Dave|https://github.com/ankurdave], is an old implementation of the Spark debugger, which demonstrated both the elegance and power behind the RDD abstraction. Unfortunately, the corresponding GitHub branch was not merged into the master branch and had stopped 2 years ago. For more information about Arthur, please refer to [the Spark Debugger Wiki page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub repository. As a useful tool for Spark application debugging and analysis, it would be nice to have a complete Spark debugger. In [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new implementation of the Spark debugger, the Spark Replay Debugger (SRD). [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview for discussion. In the current version, I only implemented features that can illustrate the basic mechanisms. There are still features appeared in Arthur but missing in SRD, such as checksum based nondeterminsm detection and single task debugging with conventional debugger (like {{jdb}}). However, these features can be easily built upon current SRD framework. To minimize code review effort, I didn't include them into the current version intentionally. Attached is the visualization of the MLlib ALS application (with 1 iteration) generated by SRD. For more information, please refer to [the SRD overview document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/]. -- This message was sent by Atlassian JIRA (v6.2#6252)