[jira] [Resolved] (SPARK-2805) Update akka to version 2.3.4
[ https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2805. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Anand Avati Update akka to version 2.3.4 Key: SPARK-2805 URL: https://issues.apache.org/jira/browse/SPARK-2805 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati Assignee: Anand Avati Fix For: 1.2.0 akka-2.3 is the lowest version available in Scala 2.11 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order to reconcile the conflicting dependencies, need to release akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2805) Update akka to version 2.3.4
[ https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2805: --- Summary: Update akka to version 2.3.4 (was: Update akka to version 2.3) Update akka to version 2.3.4 Key: SPARK-2805 URL: https://issues.apache.org/jira/browse/SPARK-2805 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati Fix For: 1.2.0 akka-2.3 is the lowest version available in Scala 2.11 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order to reconcile the conflicting dependencies, need to release akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2805) Update akka to version 2.3
[ https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2805: --- Summary: Update akka to version 2.3 (was: update akka to version 2.3) Update akka to version 2.3 -- Key: SPARK-2805 URL: https://issues.apache.org/jira/browse/SPARK-2805 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati Fix For: 1.2.0 akka-2.3 is the lowest version available in Scala 2.11 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order to reconcile the conflicting dependencies, need to release akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3872) Rewrite the test for ActorInputStream.
Prashant Sharma created SPARK-3872: -- Summary: Rewrite the test for ActorInputStream. Key: SPARK-3872 URL: https://issues.apache.org/jira/browse/SPARK-3872 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3872) Rewrite the test for ActorInputStream.
[ https://issues.apache.org/jira/browse/SPARK-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma reassigned SPARK-3872: -- Assignee: Prashant Sharma Rewrite the test for ActorInputStream. --- Key: SPARK-3872 URL: https://issues.apache.org/jira/browse/SPARK-3872 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Prashant Sharma Assignee: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3873) Scala style: check import ordering
Reynold Xin created SPARK-3873: -- Summary: Scala style: check import ordering Key: SPARK-3873 URL: https://issues.apache.org/jira/browse/SPARK-3873 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3844) Truncate appName in WebUI if it is too long
[ https://issues.apache.org/jira/browse/SPARK-3844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3844. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Target Version/s: 1.1.1, 1.2.0 (was: 1.2.0) Truncate appName in WebUI if it is too long --- Key: SPARK-3844 URL: https://issues.apache.org/jira/browse/SPARK-3844 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Trivial Fix For: 1.1.1, 1.2.0 Attachments: long-title.png If `appName` is too long, it may move off the navbar. We can put the full name inside `title` attribute while truncating the displayed name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3873) Scala style: check import ordering
[ https://issues.apache.org/jira/browse/SPARK-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164841#comment-14164841 ] Patrick Wendell commented on SPARK-3873: If we can do this it would be super, duper awesome. Scala style: check import ordering -- Key: SPARK-3873 URL: https://issues.apache.org/jira/browse/SPARK-3873 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3834) Backticks not correctly handled in subquery aliases
[ https://issues.apache.org/jira/browse/SPARK-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164857#comment-14164857 ] Ravindra Pesala commented on SPARK-3834: Ok [~marmbrus] , I will work on it. Backticks not correctly handled in subquery aliases --- Key: SPARK-3834 URL: https://issues.apache.org/jira/browse/SPARK-3834 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Ravindra Pesala Priority: Blocker [~ravi.pesala] assigning to you since you fixed the last problem here. Let me know if you don't have time to work on this or if you have any questions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3874) Provide stable TaskContext API
Patrick Wendell created SPARK-3874: -- Summary: Provide stable TaskContext API Key: SPARK-3874 URL: https://issues.apache.org/jira/browse/SPARK-3874 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma We made some improvements in SPARK-3543 but for Spark 1.2 we should convert TaskContext into a fully stable API. To do this I’d suggest the following changes - note that some of this reverses parts of SPARK-3543. The goal is to provide a class that users can’t easily construct and exposes only the public functionality. 1. Separate TaskContext into a public abstract class (TaskContext) and a private implementation called TaskContextImpl. The former should be a Java abstract class - the latter should be a private[spark] Scala class to reduce visibility (or maybe we can keep it as Java and tell people not to use it?). 2. TaskContext abstract class will have (NOTE: this changes getXX() to XX() intentionally) public isCompleted() public isInterrupted() public addTaskCompletionListener(...) public addTaskCompletionCallback(...) (deprecated) public stageId() public partitionId() public attemptId() pubic isRunningLocally() STATIC public get() set() and unset() at default visibility 3. A new private[spark] static object TaskContextHelper in the same package as TaskContext will exist to expose set() and unset() from within Spark using forwarder methods that just call TaskContext.set(). If someone within Spark wants to set this they call TaskContextHelper.set() and it forwards it. 4. TaskContextImpl will be used whenever we construct a TaskContext internally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3875) Add TEMP DIRECTORY configuration
Patrick Liu created SPARK-3875: -- Summary: Add TEMP DIRECTORY configuration Key: SPARK-3875 URL: https://issues.apache.org/jira/browse/SPARK-3875 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Liu Currently, the Spark uses java.io.tmpdir to find the /tmp/ directory. Then, the /tmp/ directory is used to 1. Setup the HTTP File Server 2. Broadcast directory 3. Fetch Dependency files or jars by Executors The size of the /tmp/ directory will keep growing. The free space of the system disk will be less. I think we could add a configuration spark.tmp.dir in conf/spark-env.sh or conf/spark-defaults.conf to set this particular directory. Let's say, set the directory to a data disk. If spark.tmp.dir is not set, use the default java.io.tmpdir -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3874) Provide stable TaskContext API
[ https://issues.apache.org/jira/browse/SPARK-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164874#comment-14164874 ] Reynold Xin commented on SPARK-3874: The proposal LGTM. Provide stable TaskContext API -- Key: SPARK-3874 URL: https://issues.apache.org/jira/browse/SPARK-3874 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma We made some improvements in SPARK-3543 but for Spark 1.2 we should convert TaskContext into a fully stable API. To do this I’d suggest the following changes - note that some of this reverses parts of SPARK-3543. The goal is to provide a class that users can’t easily construct and exposes only the public functionality. 1. Separate TaskContext into a public abstract class (TaskContext) and a private implementation called TaskContextImpl. The former should be a Java abstract class - the latter should be a private[spark] Scala class to reduce visibility (or maybe we can keep it as Java and tell people not to use it?). 2. TaskContext abstract class will have (NOTE: this changes getXX() to XX() intentionally) public isCompleted() public isInterrupted() public addTaskCompletionListener(...) public addTaskCompletionCallback(...) (deprecated) public stageId() public partitionId() public attemptId() pubic isRunningLocally() STATIC public get() set() and unset() at default visibility 3. A new private[spark] static object TaskContextHelper in the same package as TaskContext will exist to expose set() and unset() from within Spark using forwarder methods that just call TaskContext.set(). If someone within Spark wants to set this they call TaskContextHelper.set() and it forwards it. 4. TaskContextImpl will be used whenever we construct a TaskContext internally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3875) Add TEMP DIRECTORY configuration
[ https://issues.apache.org/jira/browse/SPARK-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164876#comment-14164876 ] Patrick Liu commented on SPARK-3875: https://github.com/apache/spark/pull/2729 Add TEMP DIRECTORY configuration Key: SPARK-3875 URL: https://issues.apache.org/jira/browse/SPARK-3875 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Liu Currently, the Spark uses java.io.tmpdir to find the /tmp/ directory. Then, the /tmp/ directory is used to 1. Setup the HTTP File Server 2. Broadcast directory 3. Fetch Dependency files or jars by Executors The size of the /tmp/ directory will keep growing. The free space of the system disk will be less. I think we could add a configuration spark.tmp.dir in conf/spark-env.sh or conf/spark-defaults.conf to set this particular directory. Let's say, set the directory to a data disk. If spark.tmp.dir is not set, use the default java.io.tmpdir -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3875) Add TEMP DIRECTORY configuration
[ https://issues.apache.org/jira/browse/SPARK-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164880#comment-14164880 ] Apache Spark commented on SPARK-3875: - User 'kelepi' has created a pull request for this issue: https://github.com/apache/spark/pull/2729 Add TEMP DIRECTORY configuration Key: SPARK-3875 URL: https://issues.apache.org/jira/browse/SPARK-3875 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Liu Currently, the Spark uses java.io.tmpdir to find the /tmp/ directory. Then, the /tmp/ directory is used to 1. Setup the HTTP File Server 2. Broadcast directory 3. Fetch Dependency files or jars by Executors The size of the /tmp/ directory will keep growing. The free space of the system disk will be less. I think we could add a configuration spark.tmp.dir in conf/spark-env.sh or conf/spark-defaults.conf to set this particular directory. Let's say, set the directory to a data disk. If spark.tmp.dir is not set, use the default java.io.tmpdir -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3875) Add TEMP DIRECTORY configuration
[ https://issues.apache.org/jira/browse/SPARK-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Liu updated SPARK-3875: --- Comment: was deleted (was: https://github.com/apache/spark/pull/2729) Add TEMP DIRECTORY configuration Key: SPARK-3875 URL: https://issues.apache.org/jira/browse/SPARK-3875 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Liu Currently, the Spark uses java.io.tmpdir to find the /tmp/ directory. Then, the /tmp/ directory is used to 1. Setup the HTTP File Server 2. Broadcast directory 3. Fetch Dependency files or jars by Executors The size of the /tmp/ directory will keep growing. The free space of the system disk will be less. I think we could add a configuration spark.tmp.dir in conf/spark-env.sh or conf/spark-defaults.conf to set this particular directory. Let's say, set the directory to a data disk. If spark.tmp.dir is not set, use the default java.io.tmpdir -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3845) SQLContext(...) should inherit configurations from SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164882#comment-14164882 ] Jianshi Huang commented on SPARK-3845: -- Looks like it's fixed in latest 1.2.0 snapshot. In 1.1.0, sqlContext.getAllConfs returns empty map. Jianshi SQLContext(...) should inherit configurations from SparkContext --- Key: SPARK-3845 URL: https://issues.apache.org/jira/browse/SPARK-3845 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang It's very confusing that Spark configurations (e.g. spark.serializer, spark.speculation, etc.) can be set in the spark-default.conf file, while SparkSQL configurations (e..g spark.sql.inMemoryColumnarStorage.compressed, spark.sql.codegen, etc.) has to be set either in sqlContext.setConf or sql(SET ...). When I do: val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext) I would expect sqlContext recognizes all the SQL configurations comes with sparkContext. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3158) Avoid 1 extra aggregation for DecisionTree training
[ https://issues.apache.org/jira/browse/SPARK-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3158. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2708 [https://github.com/apache/spark/pull/2708] Avoid 1 extra aggregation for DecisionTree training --- Key: SPARK-3158 URL: https://issues.apache.org/jira/browse/SPARK-3158 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Qiping Li Fix For: 1.2.0 Improvement: computation Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes). This update could be done by: * allocating a root node before the loop in the main train() method * allocating nodes for level L+1 while choosing splits for level L * caching stats in these newly allocated nodes, so that we can calculate predictions if we know they will be leaves * DecisionTree.findBestSplits can just return doneTraining This will let us cache impurity and avoid re-calculating it in calculateGainForSplit. Some above notes were copied from discussion in [https://github.com/apache/spark/pull/2341] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3876) Doing a RDD map/reduce within a DStream map fails with a high enough input rate
Andrei Filip created SPARK-3876: --- Summary: Doing a RDD map/reduce within a DStream map fails with a high enough input rate Key: SPARK-3876 URL: https://issues.apache.org/jira/browse/SPARK-3876 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Andrei Filip Having a custom receiver than generates random strings at custom rates: JavaRandomSentenceReceiver A class that does work on a received string: class LengthGetter implements Serializable{ public int getStrLength(String s){ return s.length(); } } The following code: ListLengthGetter objList = Arrays.asList(new LengthGetter(), new LengthGetter(), new LengthGetter()); final JavaRDDLengthGetter objRdd = sc.parallelize(objList); JavaInputDStreamString sentences = jssc.receiverStream(new JavaRandomSentenceReceiver(frequency)); sentences.map(new FunctionString, Integer() { @Override public Integer call(final String input) throws Exception { Integer res = objRdd.map(new FunctionLengthGetter, Integer() { @Override public Integer call(LengthGetter lg) throws Exception { return lg.getStrLength(input); } }).reduce(new Function2Integer, Integer, Integer() { @Override public Integer call(Integer left, Integer right) throws Exception { return left + right; } }); return res; } }).print(); fails for high enough frequencies with the following stack trace: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 3.0:0 failed 1 times, most recent failure: Exception failure in TID 3 on host localhost: java.lang.NullPointerException org.apache.spark.rdd.RDD.map(RDD.scala:270) org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:72) org.apache.spark.api.java.JavaRDD.map(JavaRDD.scala:29) Other information that might be useful is that my current batch duration is set to 1sec and the frequencies for JavaRandomSentenceReceiver at which the application fails are as low as 2Hz (1Hz for example works) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3830) Implement genetic algorithms in MLLib
[ https://issues.apache.org/jira/browse/SPARK-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164979#comment-14164979 ] Apache Spark commented on SPARK-3830: - User 'epahomov' has created a pull request for this issue: https://github.com/apache/spark/pull/2731 Implement genetic algorithms in MLLib - Key: SPARK-3830 URL: https://issues.apache.org/jira/browse/SPARK-3830 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Egor Pakhomov Assignee: Egor Pakhomov Priority: Minor Implement evolutionary computation algorithm in MLLib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3854) Scala style: require spaces before `{`
[ https://issues.apache.org/jira/browse/SPARK-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164978#comment-14164978 ] Kousuke Saruta commented on SPARK-3854: --- [~joshrosen] I tried to write code to check spaces before '{' like as follows. {code} package org.apache.spark.scalastyle import org.scalastyle.{PositionError, ScalariformChecker, ScalastyleError} import scala.collection.mutable.{ListBuffer, Queue} import scalariform.lexer.{Token, Tokens} import scalariform.lexer.Tokens._ import scalariform.parser.CompilationUnit class SparkSpaceBeforeLeftBraceChecker extends ScalariformChecker { val errorKey: String = insert.a.single.space.before.left.brace val rememberQueue: Queue[Token] = Queue[Token]() // The list of disallowed tokens before left brace without single space. val disallowedTokensBeforeLBrace = Seq ( ARROW, ELSE, OP, RPAREN, TRY, MATCH, NEW, DO, FINALLY, PACKAGE, RETURN, THROW, YIELD, VARID ) override def verify(ast: CompilationUnit): List[ScalastyleError] = { var list: ListBuffer[ScalastyleError] = new ListBuffer[ScalastyleError] for (token - ast.tokens) { rememberToken(token) if (isLBrace(token) isTokenAfterSpecificTokens(token) !hasSingleWhiteSpaceBefore(token)) { list += new PositionError(token.offset) } } list.toList } private def rememberToken(x: Token) = { rememberQueue.enqueue(x) if (rememberQueue.size 2) { rememberQueue.dequeue } x } private def isTokenAfterSpecificTokens(x: Token) = { val previousToken = rememberQueue.head disallowedTokensBeforeLBrace.contains(previousToken.tokenType) } private def isLBrace(x: Token) = x.tokenType == Tokens.LBRACE private def hasSingleWhiteSpaceBefore(x: Token) = x.associatedWhitespaceAndComments.whitespaces.size == 1 } {code} How does this look? Scala style: require spaces before `{` -- Key: SPARK-3854 URL: https://issues.apache.org/jira/browse/SPARK-3854 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Josh Rosen We should require spaces before opening curly braces. This isn't in the style guide, but it probably should be: {code} // Correct: if (true) { println(Wow!) } // Incorrect: if (true){ println(Wow!) } {code} See https://github.com/apache/spark/pull/1658#discussion-diff-18611791 for an example in the wild. {{git grep ){}} shows only a few occurrences of this style. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
Shixiong Zhu created SPARK-3877: --- Summary: The exit code of spark-submit is still 0 when an yarn application fails Key: SPARK-3877 URL: https://issues.apache.org/jira/browse/SPARK-3877 Project: Spark Issue Type: Bug Components: YARN Reporter: Shixiong Zhu Priority: Minor When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3507) Create RegressionLearner trait and make some currect code implement it
[ https://issues.apache.org/jira/browse/SPARK-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egor Pakhomov closed SPARK-3507. Resolution: Duplicate duplicate of SPARK-1856 Create RegressionLearner trait and make some currect code implement it -- Key: SPARK-3507 URL: https://issues.apache.org/jira/browse/SPARK-3507 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: Egor Pakhomov Assignee: Egor Pakhomov Priority: Minor Original Estimate: 168h Remaining Estimate: 168h Here in Yandex, during implementation of gradient boosting in spark and creating our ML tool for internal use, we found next serious problems in MLLib: There is no Regression/Classification learner model abstraction. We were building abstract data processing pipelines, which should work just with some regression - exact algorithm specified outside this code. There is no abstraction, which will allow me to do that. (It's main reason for all further problems) There is no common practice among MLlib for testing algorithms: every model generates it's own random test data. There is no easy extractable test cases applible to another algorithm. There is no benchmarks for comparing algorithms. After implementing new algorithm it's very hard to understand how it should be tested. Lack of serialization testing: MLlib algorithms don't contain tests which test that model work after serialization. During implementation of new algorithm it's hard to understand what API you should create and which interface to implement. Start for solving all these problems must be done in creating common interface for typical algorithms/models - regression, classification, clustering, collaborative filtering. All main tests should be written against these interfaces, so when new algorithm implemented - all it should do is passed already written tests. It allow us to have managble quality among all lib. There should be couple benchmarks which allow new spark user to get feeling about which algorithm to use. Test set against these abstractions should contain serialization test. In production most time there is no need in model, which can't be stored. As the first step of this roadmap I'd like to create trait RegressionLearner, ADD methods to current algorithms to implement this trait and create some tests against it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164993#comment-14164993 ] Apache Spark commented on SPARK-3877: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/2732 The exit code of spark-submit is still 0 when an yarn application fails --- Key: SPARK-3877 URL: https://issues.apache.org/jira/browse/SPARK-3877 Project: Spark Issue Type: Bug Components: YARN Reporter: Shixiong Zhu Priority: Minor Labels: yarn When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-2429: --- Attachment: The Result of Benchmarking a Hierarchical Clustering.pdf Sorry for making some mistakes. I fixed them. - Cluster Spec - Typy mistakes Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-2429: --- Attachment: (was: The Result of Benchmarking a Hierarchical Clustering.pdf) Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3878) Benchmarks and common tests for mllib algorithm
Egor Pakhomov created SPARK-3878: Summary: Benchmarks and common tests for mllib algorithm Key: SPARK-3878 URL: https://issues.apache.org/jira/browse/SPARK-3878 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Egor Pakhomov There is no common practice among MLlib for testing algorithms: every model generates it's own random test data. There is no easy extractable test cases applible to another algorithm. There is no benchmarks for comparing algorithms. After implementing new algorithm it's very hard to understand how it should be tested. Lack of serialization testing: MLlib algorithms don't contain tests which test that model work after serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165124#comment-14165124 ] Thomas Graves commented on SPARK-3877: -- this looks like a dup of SPARK-2167. Or actually perhaps a subset of that since I think you only handle the yarn mode. Does this cover both client and cluster mode? The exit code of spark-submit is still 0 when an yarn application fails --- Key: SPARK-3877 URL: https://issues.apache.org/jira/browse/SPARK-3877 Project: Spark Issue Type: Bug Components: YARN Reporter: Shixiong Zhu Priority: Minor Labels: yarn When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3850) Scala style: disallow trailing spaces
[ https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-3850: Summary: Scala style: disallow trailing spaces (was: Scala style: Disallow trailing spaces) Scala style: disallow trailing spaces - Key: SPARK-3850 URL: https://issues.apache.org/jira/browse/SPARK-3850 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Nicholas Chammas -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3879) spark-shell.cmd fails giving error !=x was unexpected at this time
Venkata Ramana G created SPARK-3879: --- Summary: spark-shell.cmd fails giving error !=x was unexpected at this time Key: SPARK-3879 URL: https://issues.apache.org/jira/browse/SPARK-3879 Project: Spark Issue Type: Bug Components: Spark Shell Environment: Windows Reporter: Venkata Ramana G spark-shell.cmd giving error !=x was unexpected at this time This problem is introduced during SPARK-2058 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3879) spark-shell.cmd fails giving error !=x was unexpected at this time
[ https://issues.apache.org/jira/browse/SPARK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165209#comment-14165209 ] Venkata Ramana G commented on SPARK-3879: - I have fixed the same, about to submit PR. spark-shell.cmd fails giving error !=x was unexpected at this time Key: SPARK-3879 URL: https://issues.apache.org/jira/browse/SPARK-3879 Project: Spark Issue Type: Bug Components: Spark Shell Environment: Windows Reporter: Venkata Ramana G spark-shell.cmd giving error !=x was unexpected at this time This problem is introduced during SPARK-2058 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3879) spark-shell.cmd fails giving error !=x was unexpected at this time
[ https://issues.apache.org/jira/browse/SPARK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165225#comment-14165225 ] Venkata Ramana G commented on SPARK-3879: - It was already fixed, under SPARK-3808. So can close this issue. spark-shell.cmd fails giving error !=x was unexpected at this time Key: SPARK-3879 URL: https://issues.apache.org/jira/browse/SPARK-3879 Project: Spark Issue Type: Bug Components: Spark Shell Environment: Windows Reporter: Venkata Ramana G spark-shell.cmd giving error !=x was unexpected at this time This problem is introduced during SPARK-2058 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3879) spark-shell.cmd fails giving error !=x was unexpected at this time
[ https://issues.apache.org/jira/browse/SPARK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkata Ramana G closed SPARK-3879. --- Resolution: Duplicate spark-shell.cmd fails giving error !=x was unexpected at this time Key: SPARK-3879 URL: https://issues.apache.org/jira/browse/SPARK-3879 Project: Spark Issue Type: Bug Components: Spark Shell Environment: Windows Reporter: Venkata Ramana G spark-shell.cmd giving error !=x was unexpected at this time This problem is introduced during SPARK-2058 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3850) Scala style: disallow trailing spaces
[ https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-3850: Description: [Ted Yu on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] suggested using {{WhitespaceEndOfLineChecker}} here: http://www.scalastyle.org/rules-0.1.0.html Scala style: disallow trailing spaces - Key: SPARK-3850 URL: https://issues.apache.org/jira/browse/SPARK-3850 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Nicholas Chammas [Ted Yu on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] suggested using {{WhitespaceEndOfLineChecker}} here: http://www.scalastyle.org/rules-0.1.0.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165293#comment-14165293 ] Dev Lakhani commented on SPARK-3644: Hi I am doing some work on the REST/JSON aspects and will be happy to take this on. Can someone assign it to me and/or help me get started? We need to first draft out the various endpoints and document them somewhere. Thanks Dev REST API for Spark application info (jobs / stages / tasks / storage info) -- Key: SPARK-3644 URL: https://issues.apache.org/jira/browse/SPARK-3644 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Reporter: Josh Rosen This JIRA is a forum to draft a design proposal for a REST interface for accessing information about Spark applications, such as job / stage / task / storage status. There have been a number of proposals to serve JSON representations of the information displayed in Spark's web UI. Given that we might redesign the pages of the web UI (and possibly re-implement the UI as a client of a REST API), the API endpoints and their responses should be independent of what we choose to display on particular web UI pages / layouts. Let's start a discussion of what a good REST API would look like from first-principles. We can discuss what urls / endpoints expose access to data, how our JSON responses will be formatted, how fields will be named, how the API will be documented and tested, etc. Some links for inspiration: https://developer.github.com/v3/ http://developer.netflix.com/docs/REST_API_Reference https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1486: - Assignee: (was: Burak Yavuz) Support multi-model training in MLlib - Key: SPARK-1486 URL: https://issues.apache.org/jira/browse/SPARK-1486 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Priority: Critical It is rare in practice to train just one model with a given set of parameters. Usually, this is done by training multiple models with different sets of parameters and then select the best based on their performance on the validation set. MLlib should provide native support for multi-model training/scoring. It requires decoupling of concepts like problem, formulation, algorithm, parameter set, and model, which are missing in MLlib now. MLI implements similar concepts, which we can borrow. There are different approaches for multi-model training: 0) Keep one copy of the data, and train models one after another (or maybe in parallel, depending on the scheduler). 1) Keep one copy of the data, and train multiple models at the same time (similar to `runs` in KMeans). 2) Make multiple copies of the data (still stored distributively), and use more cores to distribute the work. 3) Collect the data, make the entire dataset available on workers, and train one or more models on each worker. Users should be able to choose which execution mode they want to use. Note that 3) could cover many use cases in practice when the training data is not huge, e.g., 1GB. This task will be divided into sub-tasks and this JIRA is created to discuss the design and track the overall progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1486: - Assignee: Burak Yavuz Support multi-model training in MLlib - Key: SPARK-1486 URL: https://issues.apache.org/jira/browse/SPARK-1486 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz Priority: Critical It is rare in practice to train just one model with a given set of parameters. Usually, this is done by training multiple models with different sets of parameters and then select the best based on their performance on the validation set. MLlib should provide native support for multi-model training/scoring. It requires decoupling of concepts like problem, formulation, algorithm, parameter set, and model, which are missing in MLlib now. MLI implements similar concepts, which we can borrow. There are different approaches for multi-model training: 0) Keep one copy of the data, and train models one after another (or maybe in parallel, depending on the scheduler). 1) Keep one copy of the data, and train multiple models at the same time (similar to `runs` in KMeans). 2) Make multiple copies of the data (still stored distributively), and use more cores to distribute the work. 3) Collect the data, make the entire dataset available on workers, and train one or more models on each worker. Users should be able to choose which execution mode they want to use. Note that 3) could cover many use cases in practice when the training data is not huge, e.g., 1GB. This task will be divided into sub-tasks and this JIRA is created to discuss the design and track the overall progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3880) HBase as data source to SparkSQL
Yan created SPARK-3880: -- Summary: HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Reporter: Yan Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Component/s: SQL Fix Version/s: (was: 1.3.0) HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165359#comment-14165359 ] Nicholas Chammas commented on SPARK-3376: - [~matei], [~rxin], [~pwendell]: This is something to have on your radars, I believe. Memory-based shuffle strategy to reduce overhead of disk I/O Key: SPARK-3376 URL: https://issues.apache.org/jira/browse/SPARK-3376 Project: Spark Issue Type: Planned Work Reporter: uncleGen Priority: Trivial I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Attachment: HBaseOnSpark.docx Design Document HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Attachments: HBaseOnSpark.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165364#comment-14165364 ] Daniel Darabos commented on SPARK-3644: --- Hi Dev, thanks for the offer! Have you seen Kousuke's PR? https://github.com/apache/spark/pull/2333 seems to cover a lot of ground. Maybe he or the reviewers there can tell you how to make yourself useful! Unrelatedly, I wanted to mention that you can disregard my earlier comments. We cannot use XHR on these endpoints, since a different port means a different security domain. And anyway it turned out to be really easy to use a custom SparkListener for what we wanted to do. Sorry for the noise. REST API for Spark application info (jobs / stages / tasks / storage info) -- Key: SPARK-3644 URL: https://issues.apache.org/jira/browse/SPARK-3644 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Reporter: Josh Rosen This JIRA is a forum to draft a design proposal for a REST interface for accessing information about Spark applications, such as job / stage / task / storage status. There have been a number of proposals to serve JSON representations of the information displayed in Spark's web UI. Given that we might redesign the pages of the web UI (and possibly re-implement the UI as a client of a REST API), the API endpoints and their responses should be independent of what we choose to display on particular web UI pages / layouts. Let's start a discussion of what a good REST API would look like from first-principles. We can discuss what urls / endpoints expose access to data, how our JSON responses will be formatted, how fields will be named, how the API will be documented and tested, etc. Some links for inspiration: https://developer.github.com/v3/ http://developer.netflix.com/docs/REST_API_Reference https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3881) This is a test JIRA
[ https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Spark User updated SPARK-3881: -- Fix Version/s: (was: 1.2.0) 1.1.1 This is a test JIRA --- Key: SPARK-3881 URL: https://issues.apache.org/jira/browse/SPARK-3881 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Spark User Fix For: 1.1.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3881) This is a test JIRA
Spark User created SPARK-3881: - Summary: This is a test JIRA Key: SPARK-3881 URL: https://issues.apache.org/jira/browse/SPARK-3881 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Spark User Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3881) This is a test JIRA
[ https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Spark User updated SPARK-3881: -- Fix Version/s: 1.3.0 This is a test JIRA --- Key: SPARK-3881 URL: https://issues.apache.org/jira/browse/SPARK-3881 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Spark User Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3881) This is a test JIRA
[ https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Spark User updated SPARK-3881: -- Fix Version/s: (was: 1.1.1) This is a test JIRA --- Key: SPARK-3881 URL: https://issues.apache.org/jira/browse/SPARK-3881 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Spark User Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3881) This is a test JIRA
[ https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Spark User updated SPARK-3881: -- Fix Version/s: 1.3.0 This is a test JIRA --- Key: SPARK-3881 URL: https://issues.apache.org/jira/browse/SPARK-3881 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Spark User Assignee: Spark User Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3881) This is a test JIRA
[ https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Spark User updated SPARK-3881: -- Fix Version/s: 1.2.0 This is a test JIRA --- Key: SPARK-3881 URL: https://issues.apache.org/jira/browse/SPARK-3881 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Spark User Assignee: Spark User Fix For: 1.2.0, 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3881) This is a test JIRA
[ https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Spark User resolved SPARK-3881. --- Resolution: Invalid This is a test JIRA --- Key: SPARK-3881 URL: https://issues.apache.org/jira/browse/SPARK-3881 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Spark User Assignee: Spark User Fix For: 1.2.0, 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165480#comment-14165480 ] RJ Nowling commented on SPARK-2429: --- Great work, Yu! Ok, first off, let me make sure I understand what you're doing. You start with 2 centers. You assign all the points. You then apply KMeans recursively to each cluster, splitting each center into 2 centers. Each instance of KMeans stops when the error is below a certain value or a fixed number of iterations have been run. I think your analysis of the overall run time is good and probably what we expect. Can you break down the timing to see which parts are the most expensive? Maybe we can figure out where to optimize it. A few thoughts on optimization: 1. It might be good to convert everything to Breeze vectors before you do any operations -- you need to convert the same vectors over and over again. KMeans converts them at the beginning and converts the vectors for the centers back at the end. 2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, look into using a broadcast variable. See the latest KMeans implementation. This could improve performance by 10%+. 3. You may want to look into using reduceByKey or similar RDD operations -- they will enable parallel reductions which will be faster than a loop on the master. If you look at the JIRAs and PRs, there is some recent work to speed up KMeans -- maybe some of that is applicable? I'll probably have more questions -- it's a good way of helping me understand what you're doing :) Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3741) ConnectionManager.sendMessage may not propagate errors to MessageStatus
[ https://issues.apache.org/jira/browse/SPARK-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3741. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2593 [https://github.com/apache/spark/pull/2593] ConnectionManager.sendMessage may not propagate errors to MessageStatus --- Key: SPARK-3741 URL: https://issues.apache.org/jira/browse/SPARK-3741 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Minor Fix For: 1.2.0 If some network exception happens, ConnectionManager.sendMessage won't notify MessageStatus. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165500#comment-14165500 ] Josh Rosen commented on SPARK-3644: --- I think that a REST/JSON API is going to share many of the same design concerns as a stable Java/Scala-based progress reporting API for Spark. It would be great to use consistent naming across both APIs. SPARK-2321 deals with a Java API to expose programmatic access to many of the same things that a REST API would expose. I have a pull request open for discussing the design of this API: https://github.com/apache/spark/pull/2696 It would be great if anyone here would comment on that PR / JIRA so that we can work out the basic issues of what to expose / how to expose it in the Java API. Once we've figured this out, providing a REST wrapper should be fairly trivial. REST API for Spark application info (jobs / stages / tasks / storage info) -- Key: SPARK-3644 URL: https://issues.apache.org/jira/browse/SPARK-3644 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Reporter: Josh Rosen This JIRA is a forum to draft a design proposal for a REST interface for accessing information about Spark applications, such as job / stage / task / storage status. There have been a number of proposals to serve JSON representations of the information displayed in Spark's web UI. Given that we might redesign the pages of the web UI (and possibly re-implement the UI as a client of a REST API), the API endpoints and their responses should be independent of what we choose to display on particular web UI pages / layouts. Let's start a discussion of what a good REST API would look like from first-principles. We can discuss what urls / endpoints expose access to data, how our JSON responses will be formatted, how fields will be named, how the API will be documented and tested, etc. Some links for inspiration: https://developer.github.com/v3/ http://developer.netflix.com/docs/REST_API_Reference https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3854) Scala style: require spaces before `{`
[ https://issues.apache.org/jira/browse/SPARK-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165507#comment-14165507 ] Josh Rosen commented on SPARK-3854: --- Hey [~sarutak], I don't really know anything about how Scalastyle works, so I'm going to defer to [~prashant_] (ScrapCodes on GitHub), who implemented our current Scalastyle extensions: https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle Scala style: require spaces before `{` -- Key: SPARK-3854 URL: https://issues.apache.org/jira/browse/SPARK-3854 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Josh Rosen We should require spaces before opening curly braces. This isn't in the style guide, but it probably should be: {code} // Correct: if (true) { println(Wow!) } // Incorrect: if (true){ println(Wow!) } {code} See https://github.com/apache/spark/pull/1658#discussion-diff-18611791 for an example in the wild. {{git grep ){}} shows only a few occurrences of this style. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected
[ https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165513#comment-14165513 ] Matt Cheah commented on SPARK-3736: --- Are the two linked cases above different though? (1) If the worker itself gets locked up, the master sends a heartbeat but the worker doesn't respond, and the master drops the connection with the worker. However the master doesn't send a message to the worker indicating this disconnection, so the worker can't know to reconnect. To repro this I set a breakpoint in the Worker's heartbeat reception code and let the worker time out, and after the worker times out it never receives a DissassociatedEvent, nor is Worker.masterDisconnected() ever called. (2) If the master crashes, the Worker receives a DissassociatedEvent and sits idly. We can fix this with actively attempting to reconnect. Clearly we can address the second case with the Worker actively trying to reconnect itself. But how can we address the first case? Workers should reconnect to Master if disconnected -- Key: SPARK-3736 URL: https://issues.apache.org/jira/browse/SPARK-3736 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Andrew Ash Assignee: Matthew Cheah Priority: Critical In standalone mode, when a worker gets disconnected from the master for some reason it never attempts to reconnect. In this situation you have to bounce the worker before it will reconnect to the master. The preferred alternative is to follow what Hadoop does -- when there's a disconnect, attempt to reconnect at a particular interval until successful (I think it repeats indefinitely every 10sec). This has been observed by: - [~pkolaczk] in http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html - [~romi-totango] in http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html - [~aash] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3736) Workers should reconnect to Master if disconnected
[ https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165513#comment-14165513 ] Matt Cheah edited comment on SPARK-3736 at 10/9/14 6:42 PM: Are the two linked cases above different though? (1) If the worker itself gets locked up, the master sends a heartbeat but the worker doesn't respond, and the master drops the connection with the worker. However the master doesn't send a message to the worker indicating this disconnection, so the worker can't know to reconnect. To repro this I set a breakpoint in the Worker's heartbeat reception code and let the worker time out, and after the worker times out it never receives a DissassociatedEvent, nor is Worker.masterDisconnected() ever called. (2) If the master crashes, the Worker receives a DissassociatedEvent and sits idly. We can fix this with making the Worker actively attempt to reconnect. Clearly we can address the second case with the Worker actively trying to reconnect itself. But how can we address the first case? was (Author: mcheah): Are the two linked cases above different though? (1) If the worker itself gets locked up, the master sends a heartbeat but the worker doesn't respond, and the master drops the connection with the worker. However the master doesn't send a message to the worker indicating this disconnection, so the worker can't know to reconnect. To repro this I set a breakpoint in the Worker's heartbeat reception code and let the worker time out, and after the worker times out it never receives a DissassociatedEvent, nor is Worker.masterDisconnected() ever called. (2) If the master crashes, the Worker receives a DissassociatedEvent and sits idly. We can fix this with actively attempting to reconnect. Clearly we can address the second case with the Worker actively trying to reconnect itself. But how can we address the first case? Workers should reconnect to Master if disconnected -- Key: SPARK-3736 URL: https://issues.apache.org/jira/browse/SPARK-3736 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Andrew Ash Assignee: Matthew Cheah Priority: Critical In standalone mode, when a worker gets disconnected from the master for some reason it never attempts to reconnect. In this situation you have to bounce the worker before it will reconnect to the master. The preferred alternative is to follow what Hadoop does -- when there's a disconnect, attempt to reconnect at a particular interval until successful (I think it repeats indefinitely every 10sec). This has been observed by: - [~pkolaczk] in http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html - [~romi-totango] in http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html - [~aash] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
Davis Shepherd created SPARK-3882: - Summary: JobProgressListener gets permanently out of sync with long running job Key: SPARK-3882 URL: https://issues.apache.org/jira/browse/SPARK-3882 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.2 Reporter: Davis Shepherd A long running spark context (non-streaming) will eventually start throwing the following in the driver: java.util.NoSuchElementException: key not found: 12771 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 12782 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) And the ui will show running jobs that are in fact no longer running and never clean them up. (see attached screenshot) The result is that the ui becomes unusable, and the JobProgressListener leaks
[jira] [Updated] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-3882: -- Attachment: Screen Shot 2014-10-03 at 12.50.59 PM.png Lots of orphaned jobs. JobProgressListener gets permanently out of sync with long running job -- Key: SPARK-3882 URL: https://issues.apache.org/jira/browse/SPARK-3882 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.2 Reporter: Davis Shepherd Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png A long running spark context (non-streaming) will eventually start throwing the following in the driver: java.util.NoSuchElementException: key not found: 12771 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 12782 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at
[jira] [Comment Edited] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165535#comment-14165535 ] Davis Shepherd edited comment on SPARK-3882 at 10/9/14 6:51 PM: Attached web ui screenshot. was (Author: dgshep): Lots of orphaned jobs. JobProgressListener gets permanently out of sync with long running job -- Key: SPARK-3882 URL: https://issues.apache.org/jira/browse/SPARK-3882 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.2 Reporter: Davis Shepherd Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png A long running spark context (non-streaming) will eventually start throwing the following in the driver: java.util.NoSuchElementException: key not found: 12771 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 12782 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at
[jira] [Created] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections
Jacek Lewandowski created SPARK-3883: Summary: Provide SSL support for Akka and HttpServer based connections Key: SPARK-3883 URL: https://issues.apache.org/jira/browse/SPARK-3883 Project: Spark Issue Type: Story Components: Spark Core Reporter: Jacek Lewandowski Spark uses at least 4 logical communication channels: 1. Control messages - Akka based 2. JARs and other files - Jetty based (HttpServer) 3. Computation results - Java NIO based 4. Web UI - Jetty based The aim of this feature is to enable SSL for (1) and (2). Why: Spark configuration is sent through (1). Spark configuration may contain sensitive information like credentials for accessing external data sources or streams. Application JAR files (2) may include the application logic and therefore they may include information about the structure of the external data sources, and credentials as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3711) Optimize where in clause filter queries
[ https://issues.apache.org/jira/browse/SPARK-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3711. - Resolution: Fixed Fix Version/s: (was: 1.1.1) 1.2.0 Issue resolved by pull request 2561 [https://github.com/apache/spark/pull/2561] Optimize where in clause filter queries --- Key: SPARK-3711 URL: https://issues.apache.org/jira/browse/SPARK-3711 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Yash Datta Priority: Minor Fix For: 1.2.0 The In case class is replaced by a InSet class in case all the filters are literals, which uses a hashset instead of Sequence, thereby giving significant performance improvement. Maximum improvement should be visible in case small percentage of large data matches the filter list -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3806) minor bug in CliSuite
[ https://issues.apache.org/jira/browse/SPARK-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3806. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2666 [https://github.com/apache/spark/pull/2666] minor bug in CliSuite - Key: SPARK-3806 URL: https://issues.apache.org/jira/browse/SPARK-3806 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 Clisuite throw exception as follows: Exception in thread Thread-6 java.lang.IndexOutOfBoundsException: 6 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67) at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78) at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78) at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96) at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135) at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135) at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175) at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179) at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164) at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162) at scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73) at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165702#comment-14165702 ] Reynold Xin commented on SPARK-3376: It is definitely possible. We should evaluate the benefit. What I find recently is that with SSDs and zero copy send, disk-based shuffle can be pretty fast as well. That is, the network (assuming 10G) is the new bottleneck. Memory-based shuffle strategy to reduce overhead of disk I/O Key: SPARK-3376 URL: https://issues.apache.org/jira/browse/SPARK-3376 Project: Spark Issue Type: Planned Work Reporter: uncleGen Priority: Trivial I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3868) Hard to recognize which module is tested from unit-tests.log
[ https://issues.apache.org/jira/browse/SPARK-3868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-3868: - Assignee: Josh Rosen Hard to recognize which module is tested from unit-tests.log Key: SPARK-3868 URL: https://issues.apache.org/jira/browse/SPARK-3868 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20 Reporter: cocoatomo Assignee: Josh Rosen Labels: pyspark, testing Fix For: 1.2.0 ./python/run-tests script display messages about which test it is running currently on stdout but not write them on unit-tests.log. It is harder for us to recognize what test programs were executed and which test was failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI
[ https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165722#comment-14165722 ] Matt Cheah commented on SPARK-3835: --- Any updates on this? Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI Key: SPARK-3835 URL: https://issues.apache.org/jira/browse/SPARK-3835 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Matt Cheah Labels: UI Spark applications that crash or are killed are listed as FINISHED in the Spark UI. It looks like the Master only passes back a list of Running applications and a list of Completed applications, All of the applications under Completed have status FINISHED, however if they were killed manually they should show CANCELLED, or if they failed they should read FAILED. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3868) Hard to recognize which module is tested from unit-tests.log
[ https://issues.apache.org/jira/browse/SPARK-3868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3868: -- Assignee: cocoatomo (was: Josh Rosen) Hard to recognize which module is tested from unit-tests.log Key: SPARK-3868 URL: https://issues.apache.org/jira/browse/SPARK-3868 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20 Reporter: cocoatomo Assignee: cocoatomo Labels: pyspark, testing Fix For: 1.2.0 ./python/run-tests script display messages about which test it is running currently on stdout but not write them on unit-tests.log. It is harder for us to recognize what test programs were executed and which test was failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI
[ https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165722#comment-14165722 ] Matt Cheah edited comment on SPARK-3835 at 10/9/14 8:48 PM: Any updates on this? I've tried tackling it myself but I'm actually not sure how possible this is - killing a JVM just causes a DisassociatedEvent to be fired... but a DisassociatedEvent is also fired if SparkContext.stop() is called, making it hard to tell if a context was stopped gracefully or forcefully. was (Author: mcheah): Any updates on this? Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI Key: SPARK-3835 URL: https://issues.apache.org/jira/browse/SPARK-3835 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Matt Cheah Labels: UI Spark applications that crash or are killed are listed as FINISHED in the Spark UI. It looks like the Master only passes back a list of Running applications and a list of Completed applications, All of the applications under Completed have status FINISHED, however if they were killed manually they should show CANCELLED, or if they failed they should read FAILED. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3853) JsonRDD does not support converting fields to type Timestamp
[ https://issues.apache.org/jira/browse/SPARK-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3853. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2720 [https://github.com/apache/spark/pull/2720] JsonRDD does not support converting fields to type Timestamp Key: SPARK-3853 URL: https://issues.apache.org/jira/browse/SPARK-3853 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Michael Timper Fix For: 1.2.0 create a SchemaRDD using eventsSchema = sqlContext.jsonRDD(jsonEventsRdd, schemaWithTimestampField) eventsSchema.registerTempTable(events) sqlContext.sql(select max(time_field) from events) Throws this exception: scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:357) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:391) .. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3884) Don't set SPARK_SUBMIT_DRIVER_MEMORY if deploy mode is cluster
Sandy Ryza created SPARK-3884: - Summary: Don't set SPARK_SUBMIT_DRIVER_MEMORY if deploy mode is cluster Key: SPARK-3884 URL: https://issues.apache.org/jira/browse/SPARK-3884 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
[ https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-3884: -- Summary: If deploy mode is cluster, --driver-memory shouldn't apply to client JVM (was: Don't set SPARK_SUBMIT_DRIVER_MEMORY if deploy mode is cluster) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM Key: SPARK-3884 URL: https://issues.apache.org/jira/browse/SPARK-3884 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI
[ https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165768#comment-14165768 ] Nan Zhu commented on SPARK-3835: this problem still exists? I once reported the same thing in SPARK-1118 Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI Key: SPARK-3835 URL: https://issues.apache.org/jira/browse/SPARK-3835 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Matt Cheah Labels: UI Spark applications that crash or are killed are listed as FINISHED in the Spark UI. It looks like the Master only passes back a list of Running applications and a list of Completed applications, All of the applications under Completed have status FINISHED, however if they were killed manually they should show CANCELLED, or if they failed they should read FAILED. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3885) Provide mechanism to remove accumulators once they are no longer used
Josh Rosen created SPARK-3885: - Summary: Provide mechanism to remove accumulators once they are no longer used Key: SPARK-3885 URL: https://issues.apache.org/jira/browse/SPARK-3885 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.0.2, 1.2.0 Reporter: Josh Rosen Spark does not currently provide any mechanism to delete accumulators after they are no longer used. This can lead to OOMs for long-lived SparkContexts that create many large accumulators. Part of the problem is that accumulators are registered in a global {{Accumulators}} registry. Maybe the fix would be as simple as using weak references in the Accumulators registry so that accumulators can be GC'd once they can no longer be used. In the meantime, here's a workaround that users can try: Accumulators have a public setValue() method that can be called (only by the driver) to change an accumulator’s value. You might be able to use this to reset accumulators’ values to smaller objects (e.g. the “zero” object of whatever your accumulator type is, or ‘null’ if you’re sure that the accumulator will never be accessed again). This issue was originally reported by [~nkronenfeld] on the dev mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-Accumulator-question-td8709.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
[ https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165776#comment-14165776 ] Sandy Ryza commented on SPARK-3884: --- Accidentally assigned this to myself, but others should feel free to pick it up If deploy mode is cluster, --driver-memory shouldn't apply to client JVM Key: SPARK-3884 URL: https://issues.apache.org/jira/browse/SPARK-3884 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI
[ https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165785#comment-14165785 ] Matt Cheah commented on SPARK-3835: --- This is the opposite problem, actually - a Spark context that is killed forcefully, i.e. kill -9 on the JVM hosting the context, is shown as FINISHED but should be shown as KILLED. Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI Key: SPARK-3835 URL: https://issues.apache.org/jira/browse/SPARK-3835 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Matt Cheah Labels: UI Spark applications that crash or are killed are listed as FINISHED in the Spark UI. It looks like the Master only passes back a list of Running applications and a list of Completed applications, All of the applications under Completed have status FINISHED, however if they were killed manually they should show CANCELLED, or if they failed they should read FAILED. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3814) Bitwise does not work in Hive
[ https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165790#comment-14165790 ] Ravindra Pesala commented on SPARK-3814: https://github.com/apache/spark/pull/2736 Bitwise does not work in Hive Key: SPARK-3814 URL: https://issues.apache.org/jira/browse/SPARK-3814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yana Kadiyska Priority: Minor Error: java.lang.RuntimeException: Unsupported language features in query: select (case when bit_field 1=1 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' LIMIT 2 TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME mytable TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_FUNCTION when = TOK_TABLE_OR_COL bit_field 1 1 - TOK_TABLE_OR_COL r_end TOK_TABLE_OR_COL r_start TOK_NULL TOK_WHERE = TOK_TABLE_OR_COL pkey '0178-2014-07' TOK_LIMIT 2 SQLState: null ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3834) Backticks not correctly handled in subquery aliases
[ https://issues.apache.org/jira/browse/SPARK-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165800#comment-14165800 ] Ravindra Pesala commented on SPARK-3834: https://github.com/apache/spark/pull/2737 Backticks not correctly handled in subquery aliases --- Key: SPARK-3834 URL: https://issues.apache.org/jira/browse/SPARK-3834 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Ravindra Pesala Priority: Blocker [~ravi.pesala] assigning to you since you fixed the last problem here. Let me know if you don't have time to work on this or if you have any questions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI
[ https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165807#comment-14165807 ] Nan Zhu commented on SPARK-3835: ah, I see, didn't look at your description closely Does shutdown hook work? Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI Key: SPARK-3835 URL: https://issues.apache.org/jira/browse/SPARK-3835 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Matt Cheah Labels: UI Spark applications that crash or are killed are listed as FINISHED in the Spark UI. It looks like the Master only passes back a list of Running applications and a list of Completed applications, All of the applications under Completed have status FINISHED, however if they were killed manually they should show CANCELLED, or if they failed they should read FAILED. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI
[ https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165815#comment-14165815 ] Nan Zhu commented on SPARK-3835: no...it cannot capture kill -9 Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI Key: SPARK-3835 URL: https://issues.apache.org/jira/browse/SPARK-3835 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Matt Cheah Labels: UI Spark applications that crash or are killed are listed as FINISHED in the Spark UI. It looks like the Master only passes back a list of Running applications and a list of Completed applications, All of the applications under Completed have status FINISHED, however if they were killed manually they should show CANCELLED, or if they failed they should read FAILED. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3339) Support for skipping json lines that fail to parse
[ https://issues.apache.org/jira/browse/SPARK-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3339. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2680 [https://github.com/apache/spark/pull/2680] Support for skipping json lines that fail to parse -- Key: SPARK-3339 URL: https://issues.apache.org/jira/browse/SPARK-3339 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Critical Fix For: 1.2.0 When dealing with large datasets there is alway some data that fails to parse. Would be nice to handle this instead of throwing an exception requiring the user to filter it out manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3796) Create shuffle service for external block storage
[ https://issues.apache.org/jira/browse/SPARK-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-3796: -- Description: This task will be broken up into two parts -- the first, being to refactor our internal shuffle service to use a BlockTransferService which we can easily extract out into its own service, and then the second is to actually do the extraction. Here is the design document for the low-level service, nicknamed Sluice, on top of which will be Spark's BlockTransferService API: https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0 Create shuffle service for external block storage - Key: SPARK-3796 URL: https://issues.apache.org/jira/browse/SPARK-3796 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Patrick Wendell Assignee: Aaron Davidson This task will be broken up into two parts -- the first, being to refactor our internal shuffle service to use a BlockTransferService which we can easily extract out into its own service, and then the second is to actually do the extraction. Here is the design document for the low-level service, nicknamed Sluice, on top of which will be Spark's BlockTransferService API: https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3412) Add Missing Types for Row API
[ https://issues.apache.org/jira/browse/SPARK-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3412. - Resolution: Fixed Issue resolved by pull request 2529 [https://github.com/apache/spark/pull/2529] Add Missing Types for Row API - Key: SPARK-3412 URL: https://issues.apache.org/jira/browse/SPARK-3412 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3858) SchemaRDD.generate ignores alias argument
[ https://issues.apache.org/jira/browse/SPARK-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3858. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2721 [https://github.com/apache/spark/pull/2721] SchemaRDD.generate ignores alias argument - Key: SPARK-3858 URL: https://issues.apache.org/jira/browse/SPARK-3858 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Nathan Howell Priority: Minor Fix For: 1.2.0 The {{alias}} argument to {{SchemaRDD.generate}} is discarded and a constant {{None}} is supplied to the {{logical.Generate}} constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3813) Support case when conditional functions in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3813. - Resolution: Fixed Issue resolved by pull request 2678 [https://github.com/apache/spark/pull/2678] Support case when conditional functions in Spark SQL -- Key: SPARK-3813 URL: https://issues.apache.org/jira/browse/SPARK-3813 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Fix For: 1.2.0 The SQL queries which has following conditional functions are not supported in Spark SQL. {code} CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END {code} The same functions can work in Spark HiveQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3873) Scala style: check import ordering
[ https://issues.apache.org/jira/browse/SPARK-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165916#comment-14165916 ] Marcelo Vanzin commented on SPARK-3873: --- Actually looking at this, since I've been playing with the scalariform API in other places anyway... Scala style: check import ordering -- Key: SPARK-3873 URL: https://issues.apache.org/jira/browse/SPARK-3873 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Reynold Xin Assignee: Marcelo Vanzin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3886) Choose the batch size of serializer based on size of object
Davies Liu created SPARK-3886: - Summary: Choose the batch size of serializer based on size of object Key: SPARK-3886 URL: https://issues.apache.org/jira/browse/SPARK-3886 Project: Spark Issue Type: Improvement Reporter: Davies Liu The default batch size (1024) maybe will not work for huge objects, so it's better to choose the proper size based on the size of objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number
[ https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3772. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2651 [https://github.com/apache/spark/pull/2651] RDD operation on IPython REPL failed with an illegal port number Key: SPARK-3772 URL: https://issues.apache.org/jira/browse/SPARK-3772 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0 Reporter: cocoatomo Labels: pyspark Fix For: 1.2.0 To reproduce this issue, we should execute following commands on the commit: 6e27cb630de69fa5acb510b4e2f6b980742b1957. {quote} $ PYSPARK_PYTHON=ipython ./bin/pyspark ... In [1]: file = sc.textFile('README.md') In [2]: file.first() ... 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) with 1 output partitions (allowLocal=true) 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:334) 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List() 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List() 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44), which has no missing parents 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with curMem=57388, maxMem=278019440 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 265.1 MB) 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44) 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1207 bytes) 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: port out of range:1027423549 at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) at java.net.InetSocketAddress.init(InetSocketAddress.java:188) at java.net.Socket.init(Socket.java:244) at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections
[ https://issues.apache.org/jira/browse/SPARK-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165986#comment-14165986 ] Jacek Lewandowski commented on SPARK-3883: -- https://github.com/apache/spark/pull/2739 Provide SSL support for Akka and HttpServer based connections - Key: SPARK-3883 URL: https://issues.apache.org/jira/browse/SPARK-3883 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jacek Lewandowski Spark uses at least 4 logical communication channels: 1. Control messages - Akka based 2. JARs and other files - Jetty based (HttpServer) 3. Computation results - Java NIO based 4. Web UI - Jetty based The aim of this feature is to enable SSL for (1) and (2). Why: Spark configuration is sent through (1). Spark configuration may contain sensitive information like credentials for accessing external data sources or streams. Application JAR files (2) may include the application logic and therefore they may include information about the structure of the external data sources, and credentials as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3887) ConnectionManager should log remote exception when reporting remote errors
Josh Rosen created SPARK-3887: - Summary: ConnectionManager should log remote exception when reporting remote errors Key: SPARK-3887 URL: https://issues.apache.org/jira/browse/SPARK-3887 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen When reporting that a remote error occurred, the ConnectionManager should also log the stacktrace of the remote exception. This can be accomplished by sending the remote exception's stacktrace as the payload in the negative ACK / error message that's sent by the error-handling code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3888) Limit the memory used by python worker
Davies Liu created SPARK-3888: - Summary: Limit the memory used by python worker Key: SPARK-3888 URL: https://issues.apache.org/jira/browse/SPARK-3888 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Right now, we did not limit the memory by Python workers, then it maybe run out of memory and freeze the OS. it's safe to have a configurable hard limitation for it, which should be large than spark.executor.python.memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
Aaron Davidson created SPARK-3889: - Summary: JVM dies with SIGBUS, resulting in ConnectionManager failed ACK Key: SPARK-3889 URL: https://issues.apache.org/jira/browse/SPARK-3889 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Here's the first part of the core dump: {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread Executor task launch worker-170 daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
[ https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-3889: -- Description: Here's the first part of the core dump, possibly caused by a job which shuffles a lot of very small partitions. {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread Executor task launch worker-170 daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. was: Here's the first part of the core dump: {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread Executor task launch worker-170 daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. JVM dies with SIGBUS, resulting in ConnectionManager failed ACK --- Key: SPARK-3889 URL: https://issues.apache.org/jira/browse/SPARK-3889 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Here's the first part of the core dump, possibly caused by a job which shuffles a lot of very small partitions. {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread Executor task launch worker-170 daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Resolved] (SPARK-3798) Corrupted projection in Generator
[ https://issues.apache.org/jira/browse/SPARK-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3798. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2656 [https://github.com/apache/spark/pull/2656] Corrupted projection in Generator - Key: SPARK-3798 URL: https://issues.apache.org/jira/browse/SPARK-3798 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 In some cases it is possible for the output of a generator to change, resulting in a corrupted projection and thus incorrect data from a query that uses a generator (e.g., LATERAL VIEW explode). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3811) More robust / standard Utils.deleteRecursively, Utils.createTempDir
[ https://issues.apache.org/jira/browse/SPARK-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3811. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2670 [https://github.com/apache/spark/pull/2670] More robust / standard Utils.deleteRecursively, Utils.createTempDir --- Key: SPARK-3811 URL: https://issues.apache.org/jira/browse/SPARK-3811 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sean Owen Priority: Minor Fix For: 1.2.0 I noticed a few issues with how temp directories are created and deleted: *Minor* * Guava's {{Files.createTempDir()}} plus {{File.deleteOnExit()}} is used in many tests to make a temp dir, but {{Utils.createTempDir()}} seems to be the standard Spark mechanism * Call to {{File.deleteOnExit()}} could be pushed into {{Utils.createTempDir()}} as well, along with this replacement. * _I messed up the message in an exception in {{Utils}} in SPARK-3794; fixed here_ *Bit Less Minor* * {{Utils.deleteRecursively()}} fails immediately if any {{IOException}} occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end. * {{Utils.createTempDir()}} will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However {{Utils}} manages a set of all dirs to delete on shutdown already, called {{shutdownDeletePaths}}. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in {{TachyonBlockManager}}. I noticed a few other things that might be changed but wanted to ask first: * Shouldn't the set of dirs to delete be {{File}}, not just {{String}} paths? * {{Utils}} manages the set of {{TachyonFile}} that have been registered for deletion, but the shutdown hook is managed in {{TachyonBlockManager}}. Should this logic not live together, and not in {{Utils}}? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3811) More robust / standard Utils.deleteRecursively, Utils.createTempDir
[ https://issues.apache.org/jira/browse/SPARK-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3811: Assignee: Sean Owen More robust / standard Utils.deleteRecursively, Utils.createTempDir --- Key: SPARK-3811 URL: https://issues.apache.org/jira/browse/SPARK-3811 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.2.0 I noticed a few issues with how temp directories are created and deleted: *Minor* * Guava's {{Files.createTempDir()}} plus {{File.deleteOnExit()}} is used in many tests to make a temp dir, but {{Utils.createTempDir()}} seems to be the standard Spark mechanism * Call to {{File.deleteOnExit()}} could be pushed into {{Utils.createTempDir()}} as well, along with this replacement. * _I messed up the message in an exception in {{Utils}} in SPARK-3794; fixed here_ *Bit Less Minor* * {{Utils.deleteRecursively()}} fails immediately if any {{IOException}} occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end. * {{Utils.createTempDir()}} will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However {{Utils}} manages a set of all dirs to delete on shutdown already, called {{shutdownDeletePaths}}. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in {{TachyonBlockManager}}. I noticed a few other things that might be changed but wanted to ask first: * Shouldn't the set of dirs to delete be {{File}}, not just {{String}} paths? * {{Utils}} manages the set of {{TachyonFile}} that have been registered for deletion, but the shutdown hook is managed in {{TachyonBlockManager}}. Should this logic not live together, and not in {{Utils}}? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3824) Spark SQL should cache in MEMORY_AND_DISK by default
[ https://issues.apache.org/jira/browse/SPARK-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3824. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2686 [https://github.com/apache/spark/pull/2686] Spark SQL should cache in MEMORY_AND_DISK by default Key: SPARK-3824 URL: https://issues.apache.org/jira/browse/SPARK-3824 Project: Spark Issue Type: Bug Components: SQL Reporter: Patrick Wendell Assignee: Cheng Lian Priority: Blocker Fix For: 1.2.0 Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core. Especially since now we are more conservative about caching blocks and sometimes won't cache blocks we think might exceed memory, it seems good to keep persisted blocks on disk by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3834) Backticks not correctly handled in subquery aliases
[ https://issues.apache.org/jira/browse/SPARK-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3834. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2737 [https://github.com/apache/spark/pull/2737] Backticks not correctly handled in subquery aliases --- Key: SPARK-3834 URL: https://issues.apache.org/jira/browse/SPARK-3834 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Ravindra Pesala Priority: Blocker Fix For: 1.2.0 [~ravi.pesala] assigning to you since you fixed the last problem here. Let me know if you don't have time to work on this or if you have any questions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166168#comment-14166168 ] Aaron Staple commented on SPARK-1503: - Hi, I’d like to try working on this ticket. If you’d like to assign it to me, I can write a short spec and then work on a PR. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3890) remove redundant spark.executor.memory in doc
WangTaoTheTonic created SPARK-3890: -- Summary: remove redundant spark.executor.memory in doc Key: SPARK-3890 URL: https://issues.apache.org/jira/browse/SPARK-3890 Project: Spark Issue Type: Improvement Components: Documentation Reporter: WangTaoTheTonic Priority: Minor Seems like there is a redundant spark.executor.memory config item in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3795) Add scheduler hooks/heuristics for adding and removing executors
[ https://issues.apache.org/jira/browse/SPARK-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3795: - Affects Version/s: 1.1.0 Add scheduler hooks/heuristics for adding and removing executors Key: SPARK-3795 URL: https://issues.apache.org/jira/browse/SPARK-3795 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Wendell Assignee: Andrew Or To support dynamic scaling of a Spark application, Spark's scheduler will need to have hooks around explicitly decommissioning executors. We'll also need basic heuristics governing when to start/stop executors based on load. An initial goal is to keep this very simple. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3891) Support Hive Percentile UDAF with array of percentile values
Anand Mohan Tumuluri created SPARK-3891: --- Summary: Support Hive Percentile UDAF with array of percentile values Key: SPARK-3891 URL: https://issues.apache.org/jira/browse/SPARK-3891 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: Spark 1.2.0 trunk (ac302052870a650d56f2d3131c27755bb2960ad7) on CDH 5.1.0 Centos 6.5 8x 2GHz, 24GB RAM Reporter: Anand Mohan Tumuluri Spark PR 2620 brings in the support of Hive percentile UDAF. However Hive percentile and percentile_approx UDAFs also support returning an array of percentile values with the syntax percentile(BIGINT col, array(p1 [, p2]...)) or percentile_approx(DOUBLE col, array(p1 [, p2]...) [, B]) These queries are failing with the below error: 0: jdbc:hive2://dev-uuppala.sfohi.philips.com select name, percentile(turnaroundtime,array(0,0.25,0.5,0.75,1)) from exam group by name; Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 25.0 failed 4 times, most recent failure: Lost task 1.3 in stage 25.0 (TID 305, Dev-uuppala.sfohi.philips.com): java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to [Ljava.lang.Object; org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getListLength(StandardListObjectInspector.java:83) org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:259) org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ConversionHelper.convertIfNecessary(GenericUDFUtils.java:349) org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.iterate(GenericUDAFBridge.java:170) org.apache.spark.sql.hive.HiveUdafFunction.update(hiveUdfs.scala:342) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3892) Map type should have typeName
Adrian Wang created SPARK-3892: -- Summary: Map type should have typeName Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Bug Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3893) declare mutableMap/mutableSet explicitly
sjk created SPARK-3893: -- Summary: declare mutableMap/mutableSet explicitly Key: SPARK-3893 URL: https://issues.apache.org/jira/browse/SPARK-3893 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.1.0 Reporter: sjk {code:java} // current val workers = new HashSet[WorkerInfo] // sugguest val workers = new mutable.HashSet[WorkerInfo] {code} the other benefit is reminding us whether can use immutable collection instead of. most of map we used is mutable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org