[jira] [Commented] (SPARK-1498) Spark can hang if pyspark tasks fail
[ https://issues.apache.org/jira/browse/SPARK-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199968#comment-14199968 ] Kay Ousterhout commented on SPARK-1498: --- I closed this since 0.9 seems pretty ancient now. Spark can hang if pyspark tasks fail Key: SPARK-1498 URL: https://issues.apache.org/jira/browse/SPARK-1498 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0, 0.9.1, 0.9.2 Reporter: Kay Ousterhout Fix For: 1.0.0 In pyspark, when some kinds of jobs fail, Spark hangs rather than returning an error. This is partially a scheduler problem -- the scheduler sometimes thinks failed tasks succeed, even though they have a stack trace and exception. You can reproduce this problem with: ardd = sc.parallelize([(1,2,3), (4,5,6)]) brdd = sc.parallelize([(1,2,6), (4,5,9)]) ardd.join(brdd).count() The last line will run forever (the problem in this code is that the RDD entries have 3 values instead of the expected 2). I haven't verified if this is a problem for 1.0 as well as 0.9. Thanks to Shivaram for helping diagnose this issue! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1498) Spark can hang if pyspark tasks fail
[ https://issues.apache.org/jira/browse/SPARK-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-1498. --- Resolution: Fixed Spark can hang if pyspark tasks fail Key: SPARK-1498 URL: https://issues.apache.org/jira/browse/SPARK-1498 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0, 0.9.1, 0.9.2 Reporter: Kay Ousterhout Fix For: 1.0.0 In pyspark, when some kinds of jobs fail, Spark hangs rather than returning an error. This is partially a scheduler problem -- the scheduler sometimes thinks failed tasks succeed, even though they have a stack trace and exception. You can reproduce this problem with: ardd = sc.parallelize([(1,2,3), (4,5,6)]) brdd = sc.parallelize([(1,2,6), (4,5,9)]) ardd.join(brdd).count() The last line will run forever (the problem in this code is that the RDD entries have 3 values instead of the expected 2). I haven't verified if this is a problem for 1.0 as well as 0.9. Thanks to Shivaram for helping diagnose this issue! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4266) Avoid $$ JavaScript for StagePages with huge numbers of tables
Kay Ousterhout created SPARK-4266: - Summary: Avoid $$ JavaScript for StagePages with huge numbers of tables Key: SPARK-4266 URL: https://issues.apache.org/jira/browse/SPARK-4266 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kay Ousterhout Some of the new javascript added to handle hiding metrics significantly slows the page load for stages with a lot of tasks (e.g., for a job with 10K tasks, it took over a minute for the page to finish loading in Chrome on my laptop). There are at least two issues here: (1) The new table striping java script is much slower than the old CSS. The fancier javascript is only needed for the stage summary table, so we should change the task table back to using CSS so that it doesn't slow the page load for jobs with lots of tasks. (2) The javascript associated with hiding metrics is expensive when jobs have lots of tasks, I think because the jQuery selectors have to traverse a much larger DOM. The ID selectors are much more efficient, so we should consider switching to these, and/or avoiding this code in additional-metrics.js: $(input:checkbox:not(:checked)).each(function() { var column = table . + $(this).attr(name); $(column).hide(); }); by initially hiding the data when we generate the page in the render function instead, which should be easy to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4266) Avoid $$ JavaScript for StagePages with huge numbers of tables
[ https://issues.apache.org/jira/browse/SPARK-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4266: --- Priority: Critical (was: Major) Avoid $$ JavaScript for StagePages with huge numbers of tables -- Key: SPARK-4266 URL: https://issues.apache.org/jira/browse/SPARK-4266 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kay Ousterhout Priority: Critical Some of the new javascript added to handle hiding metrics significantly slows the page load for stages with a lot of tasks (e.g., for a job with 10K tasks, it took over a minute for the page to finish loading in Chrome on my laptop). There are at least two issues here: (1) The new table striping java script is much slower than the old CSS. The fancier javascript is only needed for the stage summary table, so we should change the task table back to using CSS so that it doesn't slow the page load for jobs with lots of tasks. (2) The javascript associated with hiding metrics is expensive when jobs have lots of tasks, I think because the jQuery selectors have to traverse a much larger DOM. The ID selectors are much more efficient, so we should consider switching to these, and/or avoiding this code in additional-metrics.js: $(input:checkbox:not(:checked)).each(function() { var column = table . + $(this).attr(name); $(column).hide(); }); by initially hiding the data when we generate the page in the render function instead, which should be easy to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4255) Table striping is incorrect on page load
[ https://issues.apache.org/jira/browse/SPARK-4255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-4255. --- Resolution: Fixed Fix Version/s: 1.2.0 Fixed by https://github.com/apache/spark/commit/2c84178b8283269512b1c968b9995a7bdedd7aa5 Table striping is incorrect on page load Key: SPARK-4255 URL: https://issues.apache.org/jira/browse/SPARK-4255 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor Fix For: 1.2.0 Currently, table striping (where every other row is grey, to aid readability) happens before table rows get hidden (because some metrics are hidden by default). If an odd number of contiguous table rows are hidden on page load, this means adjacent rows can both end up the same color. We should fix the table striping to happen after row hiding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4186) Support binaryFiles and binaryRecords API in Python
[ https://issues.apache.org/jira/browse/SPARK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4186. -- Resolution: Fixed Fix Version/s: 1.2.0 Support binaryFiles and binaryRecords API in Python --- Key: SPARK-4186 URL: https://issues.apache.org/jira/browse/SPARK-4186 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Reporter: Matei Zaharia Assignee: Davies Liu Fix For: 1.2.0 After SPARK-2759, we should expose these methods in Python. Shouldn't be too hard to add. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated SPARK-4267: -- Description: Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN cannot fail to run with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780) at
[jira] [Created] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
Tsuyoshi OZAWA created SPARK-4267: - Summary: Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later Key: SPARK-4267 URL: https://issues.apache.org/jira/browse/SPARK-4267 Project: Spark Issue Type: Bug Reporter: Tsuyoshi OZAWA Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.4.1 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN cannot fail to run with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)
[jira] [Updated] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated SPARK-4267: -- Description: Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN fails to launch jobs with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780) at
[jira] [Commented] (SPARK-4239) support view in HiveQL
[ https://issues.apache.org/jira/browse/SPARK-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1420#comment-1420 ] Apache Spark commented on SPARK-4239: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/3131 support view in HiveQL -- Key: SPARK-4239 URL: https://issues.apache.org/jira/browse/SPARK-4239 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4268) Use #::: to get benefit from Stream in SqlLexical.allCaseVersions
Shixiong Zhu created SPARK-4268: --- Summary: Use #::: to get benefit from Stream in SqlLexical.allCaseVersions Key: SPARK-4268 URL: https://issues.apache.org/jira/browse/SPARK-4268 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Trivial `allCaseVersions` uses `++` to concat two Stream. However, to get benefit from Stream, using ` #:::` would be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4268) Use #::: to get benefit from Stream in SqlLexical.allCaseVersions
[ https://issues.apache.org/jira/browse/SPARK-4268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1428#comment-1428 ] Apache Spark commented on SPARK-4268: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3132 Use #::: to get benefit from Stream in SqlLexical.allCaseVersions - Key: SPARK-4268 URL: https://issues.apache.org/jira/browse/SPARK-4268 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Trivial `allCaseVersions` uses `++` to concat two Stream. However, to get benefit from Stream, using ` #:::` would be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4269) Make wait time in BroadcastHashJoin configurable
Jacky Li created SPARK-4269: --- Summary: Make wait time in BroadcastHashJoin configurable Key: SPARK-4269 URL: https://issues.apache.org/jira/browse/SPARK-4269 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jacky Li Fix For: 1.2.0 In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table. In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200037#comment-14200037 ] zzc commented on SPARK-2468: Hi, Aaron Davidson, I set spark.shuffle.blockTransferService=netty and spark.shuffle.io.mode=nio, run on CentOS 5.8 with 12G files successfully , but when I set spark.shuffle.blockTransferService=netty and spark.shuffle.io.mode=epoll, there is error: Exception in thread main java.lang.UnsatisfiedLinkError: /tmp/libnetty-transport-native-epoll7072694982027222413.so: /lib64/libc.so.6: version `GLIBC_2.10' not found I find GLIBC_2.5 on CentOS 5.8 and can not upgrade, how to resolve it. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4270) Fix Cast from DateType to DecimalType.
Takuya Ueshin created SPARK-4270: Summary: Fix Cast from DateType to DecimalType. Key: SPARK-4270 URL: https://issues.apache.org/jira/browse/SPARK-4270 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin {{Cast}} from {{DateType}} to {{DecimalType}} throws {{NullPointerException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2815) Compilation failed upon the hadoop version 2.0.0-cdh4.5.0
[ https://issues.apache.org/jira/browse/SPARK-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200046#comment-14200046 ] Oliver Bye commented on SPARK-2815: --- This is still and issue for CDH4.6 I can confirm https://github.com/apache/spark/pull/151/files does fix this issue. Specifically the patch to yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala I've applied 2 line change change based on v1.1.0 and made it available here. git clone https://github.com/olibye/spark.git --branch SPARK-2815 --single-branch mvn -T 1C -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests install I've not created a pull request, as this appears to be an unsupported branch, and would just be a repeat. Compilation failed upon the hadoop version 2.0.0-cdh4.5.0 - Key: SPARK-2815 URL: https://issues.apache.org/jira/browse/SPARK-2815 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: pengyanhong Assignee: Guoqiang Li compile fail via SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt assembly, finally get error message : [error] (yarn-stable/compile:compile) Compilation failed, the following is the detail error on console: [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:26: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.YarnClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:40: not found: value YarnClient [error] val yarnClient = YarnClient.createYarnClient [error]^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:32: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:33: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:36: object util is not a member of package org.apache.hadoop.yarn.webapp [error] import org.apache.hadoop.yarn.webapp.util.WebAppUtils [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:64: value RM_AM_MAX_ATTEMPTS is not a member of object org.apache.hadoop.yarn.conf.YarnConfiguration [error] YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS) [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:66: not found: type AMRMClient [error] private var amClient: AMRMClient[ContainerRequest] = _ [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:92: not found: value AMRMClient [error] amClient = AMRMClient.createAMRMClient() [error]^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:137: not found: value WebAppUtils [error] val proxy = WebAppUtils.getProxyHostAndPort(conf) [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:40: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:618: not found: type AMRMClient [error] amClient: AMRMClient[ContainerRequest], [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:596: not found: type AMRMClient [error] amClient: AMRMClient[ContainerRequest], [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:577: not found: type AMRMClient [error] amClient:
[jira] [Commented] (SPARK-4270) Fix Cast from DateType to DecimalType.
[ https://issues.apache.org/jira/browse/SPARK-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200045#comment-14200045 ] Apache Spark commented on SPARK-4270: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/3134 Fix Cast from DateType to DecimalType. -- Key: SPARK-4270 URL: https://issues.apache.org/jira/browse/SPARK-4270 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin {{Cast}} from {{DateType}} to {{DecimalType}} throws {{NullPointerException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2815) Compilation failed upon the hadoop version 2.0.0-cdh4.5.0
[ https://issues.apache.org/jira/browse/SPARK-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200046#comment-14200046 ] Oliver Bye edited comment on SPARK-2815 at 11/6/14 10:11 AM: - This is still and issue for CDH4.6 I can confirm https://github.com/apache/spark/pull/151/files does fix this issue. Specifically the patch to yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala I've applied 2 line change change based on v1.1.0 and made it available here. {noformat} git clone https://github.com/olibye/spark.git --branch SPARK-2815 --single-branch mvn -T 1C -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests install {noformat} I've not created a pull request, as this appears to be an unsupported branch, and would just be a repeat. was (Author: olibye): This is still and issue for CDH4.6 I can confirm https://github.com/apache/spark/pull/151/files does fix this issue. Specifically the patch to yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala I've applied 2 line change change based on v1.1.0 and made it available here. git clone https://github.com/olibye/spark.git --branch SPARK-2815 --single-branch mvn -T 1C -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests install I've not created a pull request, as this appears to be an unsupported branch, and would just be a repeat. Compilation failed upon the hadoop version 2.0.0-cdh4.5.0 - Key: SPARK-2815 URL: https://issues.apache.org/jira/browse/SPARK-2815 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: pengyanhong Assignee: Guoqiang Li compile fail via SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt assembly, finally get error message : [error] (yarn-stable/compile:compile) Compilation failed, the following is the detail error on console: [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:26: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.YarnClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:40: not found: value YarnClient [error] val yarnClient = YarnClient.createYarnClient [error]^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:32: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:33: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:36: object util is not a member of package org.apache.hadoop.yarn.webapp [error] import org.apache.hadoop.yarn.webapp.util.WebAppUtils [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:64: value RM_AM_MAX_ATTEMPTS is not a member of object org.apache.hadoop.yarn.conf.YarnConfiguration [error] YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS) [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:66: not found: type AMRMClient [error] private var amClient: AMRMClient[ContainerRequest] = _ [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:92: not found: value AMRMClient [error] amClient = AMRMClient.createAMRMClient() [error]^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:137: not found: value WebAppUtils [error] val proxy = WebAppUtils.getProxyHostAndPort(conf) [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:40: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient [error] ^ [error]
[jira] [Updated] (SPARK-3000) drop old blocks to disk in parallel when memory is not large enough for caching new blocks
[ https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang, Liye updated SPARK-3000: --- Attachment: Spark-3000 Design Doc.pdf drop old blocks to disk in parallel when memory is not large enough for caching new blocks -- Key: SPARK-3000 URL: https://issues.apache.org/jira/browse/SPARK-3000 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Zhang, Liye Assignee: Zhang, Liye Attachments: Spark-3000 Design Doc.pdf In spark, rdd can be cached in memory for later use, and the cached memory size is *spark.executor.memory * spark.storage.memoryFraction* for spark version before 1.1.0, and *spark.executor.memory * spark.storage.memoryFraction * spark.storage.safetyFraction* after [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache new blocks, old blocks might be dropped to disk to free up memory for new blocks. This operation is processed by _ensureFreeSpace_ in _MemoryStore.scala_, there will always be a *accountingLock* held by the caller to ensure only one thread is dropping blocks. This method can not fully used the disks throughput when there are multiple disks on the working node. When testing our workload, we found this is really a bottleneck when size of old blocks to be dropped is really large. We have tested the parallel method on spark 1.0, the speedup is significant. So it's necessary to make dropping blocks operation in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200096#comment-14200096 ] Lianhui Wang commented on SPARK-2468: - [~adav] i use your branch and memory overhead on yarn is exist. [~zzcclp] how about your test. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4271) jetty Server can't tryport+1
honestold3 created SPARK-4271: - Summary: jetty Server can't tryport+1 Key: SPARK-4271 URL: https://issues.apache.org/jira/browse/SPARK-4271 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: operating system language is Chinese. Reporter: honestold3 if operating system language is not Englisth, occur org.apache.spark.util.Util.isBindCollision can't contains BingException message. so jetty Server can't tryport+1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4271) jetty Server can't tryport+1
[ https://issues.apache.org/jira/browse/SPARK-4271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200128#comment-14200128 ] Apache Spark commented on SPARK-4271: - User 'honestold3' has created a pull request for this issue: https://github.com/apache/spark/pull/3135 jetty Server can't tryport+1 Key: SPARK-4271 URL: https://issues.apache.org/jira/browse/SPARK-4271 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: operating system language is Chinese. Reporter: honestold3 if operating system language is not Englisth, occur org.apache.spark.util.Util.isBindCollision can't contains BingException message. so jetty Server can't tryport+1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4271) jetty Server can't tryport+1
[ https://issues.apache.org/jira/browse/SPARK-4271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4271. -- Resolution: Duplicate Well here's a good example. I'm sure this is duplicate of https://issues.apache.org/jira/browse/SPARK-4169 which is solved better in its PR, and this is still awaiting review/commit. jetty Server can't tryport+1 Key: SPARK-4271 URL: https://issues.apache.org/jira/browse/SPARK-4271 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: operating system language is Chinese. Reporter: honestold3 if operating system language is not Englisth, occur org.apache.spark.util.Util.isBindCollision can't contains BingException message. so jetty Server can't tryport+1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4272) Add more unwrap functions for primitive type in TableReader
Cheng Hao created SPARK-4272: Summary: Add more unwrap functions for primitive type in TableReader Key: SPARK-4272 URL: https://issues.apache.org/jira/browse/SPARK-4272 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor Currently, the data unwrap only support couple of primitive types, not all, it will not cause exception, but may get some performance in table scanning for the type like binary, date, timestamp, decimal etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4272) Add more unwrap functions for primitive type in TableReader
[ https://issues.apache.org/jira/browse/SPARK-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200160#comment-14200160 ] Apache Spark commented on SPARK-4272: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3136 Add more unwrap functions for primitive type in TableReader --- Key: SPARK-4272 URL: https://issues.apache.org/jira/browse/SPARK-4272 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor Currently, the data unwrap only support couple of primitive types, not all, it will not cause exception, but may get some performance in table scanning for the type like binary, date, timestamp, decimal etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4273) Providing ExternalSet to avoid OOM when count(distinct)
YanTang Zhai created SPARK-4273: --- Summary: Providing ExternalSet to avoid OOM when count(distinct) Key: SPARK-4273 URL: https://issues.apache.org/jira/browse/SPARK-4273 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: YanTang Zhai Priority: Minor Some task may OOM when count(distinct) if it needs to process many records. CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs many records, it may occupy large memory. I think a data structure ExternalSet like ExternalAppendOnlyMap could be provided to store OpenHashSet data in disks when it's capacity exceeds some threshold. For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be indicated as follows: ohs1 [d, b, c, a] = [a, b, c, d] = file1 ohs2 [e, f, g, a] = [a, e, f, g] = file2 ohs3 [e, h, i, g] = [e, g, h, i] = file3 ohs4 [j, h, a] = [a, h, j] = sortedSet When output, all keys with the same hashCode will be put into a OpenHashSet, then the iterator of this OpenHashSet is accessing. The procedure could be indicated as follows: file1- a - ohsA; file2 - a - ohsA; sortedSet - a - ohsA; ohsA - a; file1 - b - ohsB; ohsB - b; file1 - c - ohsC; ohsC - c; file1 - d - ohsD; ohsD - d; file2 - e - ohsE; file3 - e - ohsE; ohsE- e; ... I think using the ExternalSet could avoid OOM when count(distinct). Welcomes comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4273) Providing ExternalSet to avoid OOM when count(distinct)
[ https://issues.apache.org/jira/browse/SPARK-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200172#comment-14200172 ] Apache Spark commented on SPARK-4273: - User 'YanTangZhai' has created a pull request for this issue: https://github.com/apache/spark/pull/3137 Providing ExternalSet to avoid OOM when count(distinct) --- Key: SPARK-4273 URL: https://issues.apache.org/jira/browse/SPARK-4273 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: YanTang Zhai Priority: Minor Some task may OOM when count(distinct) if it needs to process many records. CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs many records, it may occupy large memory. I think a data structure ExternalSet like ExternalAppendOnlyMap could be provided to store OpenHashSet data in disks when it's capacity exceeds some threshold. For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be indicated as follows: ohs1 [d, b, c, a] = [a, b, c, d] = file1 ohs2 [e, f, g, a] = [a, e, f, g] = file2 ohs3 [e, h, i, g] = [e, g, h, i] = file3 ohs4 [j, h, a] = [a, h, j] = sortedSet When output, all keys with the same hashCode will be put into a OpenHashSet, then the iterator of this OpenHashSet is accessing. The procedure could be indicated as follows: file1- a - ohsA; file2 - a - ohsA; sortedSet - a - ohsA; ohsA - a; file1 - b - ohsB; ohsB - b; file1 - c - ohsC; ohsC - c; file1 - d - ohsD; ohsD - d; file2 - e - ohsE; file3 - e - ohsE; ohsE- e; ... I think using the ExternalSet could avoid OOM when count(distinct). Welcomes comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4249) A problem of EdgePartitionBuilder in Graphx
[ https://issues.apache.org/jira/browse/SPARK-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200188#comment-14200188 ] Apache Spark commented on SPARK-4249: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/3138 A problem of EdgePartitionBuilder in Graphx --- Key: SPARK-4249 URL: https://issues.apache.org/jira/browse/SPARK-4249 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0 Reporter: Cookies Priority: Minor https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartitionBuilder.scala#L48 in function toEdgePartition, code snippet index.update(srcIds(0), 0) the elements in array srcIds are all not initialized and are all 0. The effect is that if vertex ids don't contain 0, indexSize is equal to (realIndexSize + 1) and 0 is added to the `index` It seems that all versions have the problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4274) Hive comparison test framework doesn't print effective information while logical plan analyzing failed
Cheng Hao created SPARK-4274: Summary: Hive comparison test framework doesn't print effective information while logical plan analyzing failed Key: SPARK-4274 URL: https://issues.apache.org/jira/browse/SPARK-4274 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor Hive comparison test frame doesn't print informative message, if the unit test failed in logical plan analyzing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4274) Hive comparison test framework doesn't print effective information while logical plan analyzing failed
[ https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200228#comment-14200228 ] Apache Spark commented on SPARK-4274: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3139 Hive comparison test framework doesn't print effective information while logical plan analyzing failed --- Key: SPARK-4274 URL: https://issues.apache.org/jira/browse/SPARK-4274 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor Hive comparison test frame doesn't print informative message, if the unit test failed in logical plan analyzing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3913) Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and killApplication() API
[ https://issues.apache.org/jira/browse/SPARK-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200270#comment-14200270 ] Thomas Graves commented on SPARK-3913: -- Note that on areas 3 you should be able to use the yarn cli command to kill application 'yarn kill -applicationId appid'. Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and killApplication() API -- Key: SPARK-3913 URL: https://issues.apache.org/jira/browse/SPARK-3913 Project: Spark Issue Type: Improvement Components: YARN Reporter: Chester When working with Spark with Yarn deployment mode, we have two issues: 1) We don't know how much yarn max capacity ( memory and cores) before we specify the number of executor and memories for spark drivers and executors. We we set a big number, the job can potentially exceeds the limit and got killed. It would be better we let the application know that the yarn resource capacity a head of time and the spark config can adjusted dynamically. 2) Once job started, we would like to have some feedbacks from yarn application. Currently, the spark client basically block the call and returns when the job is finished or failed or killed. If the job runs for few hours, we have no idea how far it has gone, the progress and resource usage, tracking URL etc. 3) Once the job is started, you basically can't stop it. The Yarn Client API stop doesn't to work in most cases from our experience. But Yarn API does work is killApplication(appId). So we need to expose this killApplication() API to Spark Yarn Client as well. I will create one Pull Request and try to address these problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4238) Perform network-level retry of shuffle file fetches
[ https://issues.apache.org/jira/browse/SPARK-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200334#comment-14200334 ] Nan Zhu commented on SPARK-4238: it is related to https://issues.apache.org/jira/browse/SPARK-4188? Perform network-level retry of shuffle file fetches --- Key: SPARK-4238 URL: https://issues.apache.org/jira/browse/SPARK-4238 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical During periods of high network (or GC) load, it is not uncommon that IOExceptions crop up around connection failures when fetching shuffle files. Unfortunately, when such a failure occurs, it is interpreted as an inability to fetch the files, which causes us to mark the executor as lost and recompute all of its shuffle outputs. We should allow retrying at the network level in the event of an IOException in order to avoid this circumstance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2815) Compilation failed upon the hadoop version 2.0.0-cdh4.5.0
[ https://issues.apache.org/jira/browse/SPARK-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200356#comment-14200356 ] Apache Spark commented on SPARK-2815: - User 'pengyanhong' has created a pull request for this issue: https://github.com/apache/spark/pull/3140 Compilation failed upon the hadoop version 2.0.0-cdh4.5.0 - Key: SPARK-2815 URL: https://issues.apache.org/jira/browse/SPARK-2815 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: pengyanhong Assignee: Guoqiang Li compile fail via SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt assembly, finally get error message : [error] (yarn-stable/compile:compile) Compilation failed, the following is the detail error on console: [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:26: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.YarnClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:40: not found: value YarnClient [error] val yarnClient = YarnClient.createYarnClient [error]^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:32: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:33: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:36: object util is not a member of package org.apache.hadoop.yarn.webapp [error] import org.apache.hadoop.yarn.webapp.util.WebAppUtils [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:64: value RM_AM_MAX_ATTEMPTS is not a member of object org.apache.hadoop.yarn.conf.YarnConfiguration [error] YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS) [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:66: not found: type AMRMClient [error] private var amClient: AMRMClient[ContainerRequest] = _ [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:92: not found: value AMRMClient [error] amClient = AMRMClient.createAMRMClient() [error]^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:137: not found: value WebAppUtils [error] val proxy = WebAppUtils.getProxyHostAndPort(conf) [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:40: object api is not a member of package org.apache.hadoop.yarn.client [error] import org.apache.hadoop.yarn.client.api.AMRMClient [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:618: not found: type AMRMClient [error] amClient: AMRMClient[ContainerRequest], [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:596: not found: type AMRMClient [error] amClient: AMRMClient[ContainerRequest], [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:577: not found: type AMRMClient [error] amClient: AMRMClient[ContainerRequest], [error] ^ [error] /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:410: value CONTAINER_ID is not a member of object org.apache.hadoop.yarn.api.ApplicationConstants.Environment [error] val containerIdString = System.getenv(ApplicationConstants.Environment.CONTAINER_ID.name()) [error] ^
[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200373#comment-14200373 ] Debasish Das commented on SPARK-4231: - [~coderxiang] [~mengxr] [~srowen] I looked at Sean's implementation for MAP metric for recommendation engines and it computes a rank of predicted test set over all user/product predictions...I don't see how can I send the rank vector to the RankingMetrics API right now http://cloudera.github.io/oryx/xref/com/cloudera/oryx/als/computation/local/ComputeMAP.html Right now for each user/product I send predictions and labels from test set to RankingMetrics API but there is no rank order defined...I retrieved the predictions using ALS.predict(userId, productId) API... Add RankingMetrics to examples.MovieLensALS --- Key: SPARK-4231 URL: https://issues.apache.org/jira/browse/SPARK-4231 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.2.0 Reporter: Debasish Das Fix For: 1.2.0 Original Estimate: 24h Remaining Estimate: 24h examples.MovieLensALS computes RMSE for movielens dataset but after addition of RankingMetrics and enhancements to ALS, it is critical to look at not only the RMSE but also measures like prec@k and MAP. In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and also added a flag that takes an input whether user/product recommendation is being validated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200396#comment-14200396 ] Sean Owen commented on SPARK-4231: -- So this method basically computes where each test item would rank if you asked for a list of recommendations that ranks every single item. It's not necessarily efficient, but is simple. The reason I did it that way was to avoid recreating a lot of the recommender ranking logic. I don't think one has to define MAP this way -- I effectively averaged over all k to the # of items. Yes I found the straightforward definition hard to implement at scale. I ended up opting to compute an approximation of AUC for recommender eval in this next version I'm working on: https://github.com/OryxProject/oryx/blob/master/oryx-ml-mllib/src/main/java/com/cloudera/oryx/ml/mllib/als/AUC.java#L106 Sorry for the hard-to-read Java 7; going to redo this in Java 8 soon. Basically you're just sampling random relevant/not-relevant pairs and comparing their scores. You might consider that. I dunno if it's worth bothering with a toy implementation in the examples. The example is already just to show Spark really not ALS. Add RankingMetrics to examples.MovieLensALS --- Key: SPARK-4231 URL: https://issues.apache.org/jira/browse/SPARK-4231 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.2.0 Reporter: Debasish Das Fix For: 1.2.0 Original Estimate: 24h Remaining Estimate: 24h examples.MovieLensALS computes RMSE for movielens dataset but after addition of RankingMetrics and enhancements to ALS, it is critical to look at not only the RMSE but also measures like prec@k and MAP. In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and also added a flag that takes an input whether user/product recommendation is being validated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name
Ravi Kiran created SPARK-4275: - Summary: ./sbt/sbt assembly command fails if path has space in the name Key: SPARK-4275 URL: https://issues.apache.org/jira/browse/SPARK-4275 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Ravi Kiran Priority: Trivial I have downloaded branch-1.1 for building spark from scratch on my MAC. The path had a space like /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1, 1) I cd to the above directory 2) Ran ./sbt/sbt assembly The command fails with weird messages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200426#comment-14200426 ] Debasish Das commented on SPARK-4231: - [~srowen] I need a standard metric to report the ALS enhancements proposed in this PR: https://github.com/apache/spark/pull/2705 and more work that we are doing in this direction... The metric should be consistent for topic modeling using LSA as well as the PR can solve LSA/PLSA...I have not yet gone into measures like perplexity etc from LDA PRs as they are even more complicated but MAP, prec@k and ndcg@k are the measures people reported LDA results as well.. Does the MAP definition I used in this PR looks correct to you ? Let me look into the AUC example... Add RankingMetrics to examples.MovieLensALS --- Key: SPARK-4231 URL: https://issues.apache.org/jira/browse/SPARK-4231 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.2.0 Reporter: Debasish Das Fix For: 1.2.0 Original Estimate: 24h Remaining Estimate: 24h examples.MovieLensALS computes RMSE for movielens dataset but after addition of RankingMetrics and enhancements to ALS, it is critical to look at not only the RMSE but also measures like prec@k and MAP. In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and also added a flag that takes an input whether user/product recommendation is being validated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200470#comment-14200470 ] Sean Owen commented on SPARK-4231: -- Yes I'm mostly questioning implementing this in examples. The definition in RankingMetrics looks like the usual one to me -- average from 1 to min(# recs, # relevant items). You could say the version you found above is 'extended' to look into the long tail (# recs = # items), although the long tail doesn't affect MAP much. Same definition, different limit. precision@k does not have the same question since there is one k value, not lots. AUC may not help you if you're comparing to other things for which you don't have AUC. It was a side comment mostly. (Anyway there is already an AUC implementation here which I am trying to see if I can use.) Add RankingMetrics to examples.MovieLensALS --- Key: SPARK-4231 URL: https://issues.apache.org/jira/browse/SPARK-4231 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.2.0 Reporter: Debasish Das Fix For: 1.2.0 Original Estimate: 24h Remaining Estimate: 24h examples.MovieLensALS computes RMSE for movielens dataset but after addition of RankingMetrics and enhancements to ALS, it is critical to look at not only the RMSE but also measures like prec@k and MAP. In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and also added a flag that takes an input whether user/product recommendation is being validated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200491#comment-14200491 ] shane knapp commented on SPARK-4216: yes, the sparkQA oauth token was set inside the top-level ghprb settings. the side effect of this was that projects like tachyon had sparkQA posting on their projects and not amplab jenkins. On Wed, Nov 5, 2014 at 5:15 PM, Nicholas Chammas (JIRA) j...@apache.org Eliminate duplicate Jenkins GitHub posts from AMPLab Key: SPARK-4216 URL: https://issues.apache.org/jira/browse/SPARK-4216 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor * [Real Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873361] * [Imposter Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-644) Jobs canceled due to repeated executor failures may hang
[ https://issues.apache.org/jira/browse/SPARK-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-644. - Resolution: Fixed Jobs canceled due to repeated executor failures may hang Key: SPARK-644 URL: https://issues.apache.org/jira/browse/SPARK-644 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.6.1 Reporter: Josh Rosen Assignee: Josh Rosen In order to prevent an infinite loop, the standalone master aborts jobs that experience more than 10 executor failures (see https://github.com/mesos/spark/pull/210). Currently, the master crashes when aborting jobs (this is the issue that uncovered SPARK-643). If we fix the crash, which involves removing a {{throw}} from the actor's {{receive}} method, then these failures can lead to a hang because they cause the job to be removed from the master's scheduler, but the upstream scheduler components aren't notified of the failure and will wait for the job to finish. I've considered fixing this by adding additional callbacks to propagate the failure to the higher-level schedulers. It might be cleaner to move the decision to abort the job into the higher-level layers of the scheduler, sending an {{AbortJob(jobId)}} method to the Master. The Client is already notified of executor state changes, so it may be able to make the decision to abort (or defer that decision to a higher layer). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-643) Standalone master crashes during actor restart
[ https://issues.apache.org/jira/browse/SPARK-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-643. - Resolution: Fixed Standalone master crashes during actor restart -- Key: SPARK-643 URL: https://issues.apache.org/jira/browse/SPARK-643 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.6.1 Reporter: Josh Rosen Assignee: Josh Rosen The standalone master will crash if it restarts due to an exception: {code} 12/12/15 03:10:47 ERROR master.Master: Job SkewBenchmark wth ID job-20121215031047- failed 11 times. spark.SparkException: Job SkewBenchmark wth ID job-20121215031047- failed 11 times. at spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:103) at spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:62) at akka.actor.Actor$class.apply(Actor.scala:318) at spark.deploy.master.Master.apply(Master.scala:17) at akka.actor.ActorCell.invoke(ActorCell.scala:626) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197) at akka.dispatch.Mailbox.run(Mailbox.scala:179) at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516) at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259) at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) 12/12/15 03:10:47 INFO master.Master: Starting Spark master at spark://ip-10-226-87-193:7077 12/12/15 03:10:47 INFO io.IoWorker: IoWorker thread 'spray-io-worker-1' started 12/12/15 03:10:47 ERROR master.Master: Failed to create web UI akka.actor.InvalidActorNameException:actor name HttpServer is not unique! [05aed000-4665-11e2-b361-12313d316833] at akka.actor.ActorCell.actorOf(ActorCell.scala:392) at akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.liftedTree1$1(ActorRefProvider.scala:394) at akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:394) at akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:392) at akka.actor.Actor$class.apply(Actor.scala:318) at akka.actor.LocalActorRefProvider$Guardian.apply(ActorRefProvider.scala:388) at akka.actor.ActorCell.invoke(ActorCell.scala:626) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197) at akka.dispatch.Mailbox.run(Mailbox.scala:179) at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516) at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259) at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) {code} When the Master actor restarts, Akka calls the {{postRestart}} hook. [By default|http://doc.akka.io/docs/akka/snapshot/general/supervision.html#supervision-restart], this calls {{preStart}}. The standalone master's {{preStart}} method tries to start the webUI but crashes because it is already running. I ran into this after a job failed more than 11 times, which causes the Master to throw a SparkException from its {{receive}} method. The solution is to implement a custom {{postRestart}} hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4276) Spark streaming requires at least two working thread
varun sharma created SPARK-4276: --- Summary: Spark streaming requires at least two working thread Key: SPARK-4276 URL: https://issues.apache.org/jira/browse/SPARK-4276 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: varun sharma Fix For: 1.1.0 Spark streaming requires at least two working threads.But example in spark/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala has // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName(NetworkWordCount) val ssc = new StreamingContext(sparkConf, Seconds(1)) which creates only 1 thread. It should have atleast 2 threads: http://spark.apache.org/docs/latest/streaming-programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200514#comment-14200514 ] Matei Zaharia commented on SPARK-677: - [~joshrosen] is this fixed now? PySpark should not collect results through local filesystem --- Key: SPARK-677 URL: https://issues.apache.org/jira/browse/SPARK-677 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 0.7.0 Reporter: Josh Rosen Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance. Instead, we should stream the data from Java to Python over a local socket or a FIFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-681) Optimize hashtables used in Spark
[ https://issues.apache.org/jira/browse/SPARK-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-681. - Resolution: Fixed Optimize hashtables used in Spark - Key: SPARK-681 URL: https://issues.apache.org/jira/browse/SPARK-681 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia The hash tables used in cogroup, join, etc take up a lot more space than they need to because they're using linked data structures. It would be nice to write a custom open hashtable class to use instead, especially since these tables are append-only. A custom one would likely run better than fastutil as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200519#comment-14200519 ] Aaron Davidson commented on SPARK-2468: --- [~zzcclp] Use of epoll mode is highly dependent on your environment, and I personally would not recommend it due to known netty bugs which may cause it to be less stable. We have found nio mode to be sufficiently performant in our testing (and netty actually still tries to use epoll if it's available as its selector). [~lianhuiwang] Could you please elaborate on what you mean? Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4188) Shuffle fetches should be retried at a lower level
[ https://issues.apache.org/jira/browse/SPARK-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200527#comment-14200527 ] Apache Spark commented on SPARK-4188: - User 'aarondav' has created a pull request for this issue: https://github.com/apache/spark/pull/3101 Shuffle fetches should be retried at a lower level -- Key: SPARK-4188 URL: https://issues.apache.org/jira/browse/SPARK-4188 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson During periods of high network (or GC) load, it is not uncommon that IOExceptions crop up around connection failures when fetching shuffle files. Unfortunately, when such a failure occurs, it is interpreted as an inability to fetch the files, which causes us to mark the executor as lost and recompute all of its shuffle outputs. We should allow retrying at the network level in the event of an IOException in order to avoid this circumstance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4238) Perform network-level retry of shuffle file fetches
[ https://issues.apache.org/jira/browse/SPARK-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson closed SPARK-4238. - Resolution: Duplicate Perform network-level retry of shuffle file fetches --- Key: SPARK-4238 URL: https://issues.apache.org/jira/browse/SPARK-4238 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical During periods of high network (or GC) load, it is not uncommon that IOExceptions crop up around connection failures when fetching shuffle files. Unfortunately, when such a failure occurs, it is interpreted as an inability to fetch the files, which causes us to mark the executor as lost and recompute all of its shuffle outputs. We should allow retrying at the network level in the event of an IOException in order to avoid this circumstance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4188) Shuffle fetches should be retried at a lower level
[ https://issues.apache.org/jira/browse/SPARK-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-4188: -- Description: During periods of high network (or GC) load, it is not uncommon that IOExceptions crop up around connection failures when fetching shuffle files. Unfortunately, when such a failure occurs, it is interpreted as an inability to fetch the files, which causes us to mark the executor as lost and recompute all of its shuffle outputs. We should allow retrying at the network level in the event of an IOException in order to avoid this circumstance. was:Sometimes fetches will fail due to garbage collection pauses or network load. A simple retry could save recomputation of a lot of shuffle data, especially if it's below the task level (i.e., on the level of a single fetch). Shuffle fetches should be retried at a lower level -- Key: SPARK-4188 URL: https://issues.apache.org/jira/browse/SPARK-4188 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson During periods of high network (or GC) load, it is not uncommon that IOExceptions crop up around connection failures when fetching shuffle files. Unfortunately, when such a failure occurs, it is interpreted as an inability to fetch the files, which causes us to mark the executor as lost and recompute all of its shuffle outputs. We should allow retrying at the network level in the event of an IOException in order to avoid this circumstance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default
[ https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-993. - Resolution: Won't Fix We investigated this for 1.0 but found that many InputFormats behave wrongly if you try to clone the object, so we won't fix it. Don't reuse Writable objects in HadoopRDDs by default - Key: SPARK-993 URL: https://issues.apache.org/jira/browse/SPARK-993 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Right now we reuse them as an optimization, which leads to weird results when you call collect() on a file with distinct items. We should instead make that behavior optional through a flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4238) Perform network-level retry of shuffle file fetches
[ https://issues.apache.org/jira/browse/SPARK-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200526#comment-14200526 ] Aaron Davidson commented on SPARK-4238: --- Oh, whoops. Perform network-level retry of shuffle file fetches --- Key: SPARK-4238 URL: https://issues.apache.org/jira/browse/SPARK-4238 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical During periods of high network (or GC) load, it is not uncommon that IOExceptions crop up around connection failures when fetching shuffle files. Unfortunately, when such a failure occurs, it is interpreted as an inability to fetch the files, which causes us to mark the executor as lost and recompute all of its shuffle outputs. We should allow retrying at the network level in the event of an IOException in order to avoid this circumstance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default
[ https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200531#comment-14200531 ] Matei Zaharia commented on SPARK-993: - Arun, you'd see this issue if you do collect() or take() and then println. The problem is that the same Text object (for example) is referenced for all records in the dataset. The counts will be okay. Don't reuse Writable objects in HadoopRDDs by default - Key: SPARK-993 URL: https://issues.apache.org/jira/browse/SPARK-993 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Right now we reuse them as an optimization, which leads to weird results when you call collect() on a file with distinct items. We should instead make that behavior optional through a flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1000) Crash when running SparkPi example with local-cluster
[ https://issues.apache.org/jira/browse/SPARK-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-1000. Resolution: Cannot Reproduce Crash when running SparkPi example with local-cluster - Key: SPARK-1000 URL: https://issues.apache.org/jira/browse/SPARK-1000 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: xiajunluan Assignee: Andrew Or when I run SparkPi with local-cluster[2,2,512], it will throw following exception at the end of job. WARNING: An exception was thrown by an exception handler. java.util.concurrent.RejectedExecutionException at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.start(AbstractNioWorker.java:184) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:330) at org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:35) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:313) at org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:35) at org.jboss.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34) at org.jboss.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:504) at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:47) at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170) at org.jboss.netty.channel.socket.nio.NioClientSocketChannel.init(NioClientSocketChannel.java:79) at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:176) at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:82) at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:213) at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:183) at akka.remote.netty.ActiveRemoteClient$$anonfun$connect$1.apply$mcV$sp(Client.scala:173) at akka.util.Switch.liftedTree1$1(LockUtil.scala:33) at akka.util.Switch.transcend(LockUtil.scala:32) at akka.util.Switch.switchOn(LockUtil.scala:55) at akka.remote.netty.ActiveRemoteClient.connect(Client.scala:158) at akka.remote.netty.NettyRemoteTransport.send(NettyRemoteSupport.scala:153) at akka.remote.RemoteActorRef.$bang(RemoteActorRefProvider.scala:247) at akka.actor.LocalDeathWatch$$anonfun$publish$1.apply(ActorRefProvider.scala:559) at akka.actor.LocalDeathWatch$$anonfun$publish$1.apply(ActorRefProvider.scala:559) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.immutable.VectorIterator.foreach(Vector.scala:648) at scala.collection.IterableLike$class.foreach(IterableLike.scala:73) at scala.collection.immutable.Vector.foreach(Vector.scala:63) at akka.actor.LocalDeathWatch.publish(ActorRefProvider.scala:559) at akka.remote.RemoteDeathWatch.publish(RemoteActorRefProvider.scala:280) at akka.remote.RemoteDeathWatch.publish(RemoteActorRefProvider.scala:262) at akka.actor.ActorCell.doTerminate(ActorCell.scala:701) at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:747) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:608) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:209) at akka.dispatch.Mailbox.run(Mailbox.scala:178) at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516) at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259) at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1023) Remove Thread.sleep(5000) from TaskSchedulerImpl
[ https://issues.apache.org/jira/browse/SPARK-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1023. -- Resolution: Fixed Remove Thread.sleep(5000) from TaskSchedulerImpl Key: SPARK-1023 URL: https://issues.apache.org/jira/browse/SPARK-1023 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Fix For: 1.0.0 This causes the unit tests to take super long. We should figure out why this exists and see if we can lower it or do something smarter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1185) In Spark Programming Guide, Master URLs should mention yarn-client
[ https://issues.apache.org/jira/browse/SPARK-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1185. -- Resolution: Fixed In Spark Programming Guide, Master URLs should mention yarn-client Key: SPARK-1185 URL: https://issues.apache.org/jira/browse/SPARK-1185 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 0.9.0 Reporter: Sandy Pérez González Assignee: Sandy Pérez González It would also be helpful to mention that the reason a host:post isn't required for YARN mode is that it comes from the Hadoop configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200538#comment-14200538 ] Norman He commented on SPARK-2447: -- Hi Ted and Tathagaat Das, Would the spark/sparkstreaming consider making HBaseContext an facade to encolse all the HBase simple Get /Put methods? Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4276) Spark streaming requires at least two working thread
[ https://issues.apache.org/jira/browse/SPARK-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200539#comment-14200539 ] Sean Owen commented on SPARK-4276: -- This is basically the same concern addressed already by https://issues.apache.org/jira/browse/SPARK-4040 no? This code can set a master of, say, local[2], but it was my understanding that all examples don't set a master and this is supplied by spark-submit. Then again SPARK-4040 changed the doc example to set a local[2] master. hm. Spark streaming requires at least two working thread Key: SPARK-4276 URL: https://issues.apache.org/jira/browse/SPARK-4276 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: varun sharma Fix For: 1.1.0 Spark streaming requires at least two working threads.But example in spark/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala has // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName(NetworkWordCount) val ssc = new StreamingContext(sparkConf, Seconds(1)) which creates only 1 thread. It should have atleast 2 threads: http://spark.apache.org/docs/latest/streaming-programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4276) Spark streaming requires at least two working thread
[ https://issues.apache.org/jira/browse/SPARK-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200540#comment-14200540 ] Apache Spark commented on SPARK-4276: - User 'svar29' has created a pull request for this issue: https://github.com/apache/spark/pull/3141 Spark streaming requires at least two working thread Key: SPARK-4276 URL: https://issues.apache.org/jira/browse/SPARK-4276 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: varun sharma Fix For: 1.1.0 Spark streaming requires at least two working threads.But example in spark/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala has // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName(NetworkWordCount) val ssc = new StreamingContext(sparkConf, Seconds(1)) which creates only 1 thread. It should have atleast 2 threads: http://spark.apache.org/docs/latest/streaming-programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2237) Add ZLIBCompressionCodec code
[ https://issues.apache.org/jira/browse/SPARK-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-2237. Resolution: Won't Fix Add ZLIBCompressionCodec code - Key: SPARK-2237 URL: https://issues.apache.org/jira/browse/SPARK-2237 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Yanjie Gao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error
[ https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2348: - Priority: Critical (was: Major) In Windows having a enviorinment variable named 'classpath' gives error --- Key: SPARK-2348 URL: https://issues.apache.org/jira/browse/SPARK-2348 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Windows 7 Enterprise Reporter: Chirag Todarka Assignee: Chirag Todarka Priority: Critical Operating System:: Windows 7 Enterprise If having enviorinment variable named 'classpath' gives then starting 'spark-shell' gives below error:: mydir\spark\binspark-shell Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler acces sed before init set up. Assuming no postInit code. Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. Exception in thread main java.lang.AssertionError: assertion failed: null at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca la:202) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar kILoop.scala:929) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass Loader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200554#comment-14200554 ] Ted Malaska commented on SPARK-2447: Hey Norman, Totally agree. TD and I talked about SparkOnHBase at Hadoop World. Times where crazy leading up to Hadoop World. So I'm doing the following things: 1. I'm writing up a Blog for SparkOnHBase 2. TD is working on directions for how this code should be integrated with Spark 3. I have been working out little bugs with Java integration 4. I want to build a couple more examples 5. I'm having a problem with Maven where the Java JUnits are not executing 6. I adding support for Kerberos But yes the facade is coming. :) Let me know if you want to help. Just do a pull request on https://github.com/tmalaska/SparkOnHBase Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4276) Spark streaming requires at least two working thread
[ https://issues.apache.org/jira/browse/SPARK-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200571#comment-14200571 ] varun sharma commented on SPARK-4276: - [~srowen] Yeah right. I spent lot of time to figure this out as example in repo and in documentation are different. I now understood the concept of resource starvation here. Thanks for pointing it out and sorry for the inconvenience. Spark streaming requires at least two working thread Key: SPARK-4276 URL: https://issues.apache.org/jira/browse/SPARK-4276 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: varun sharma Fix For: 1.1.0 Spark streaming requires at least two working threads.But example in spark/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala has // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName(NetworkWordCount) val ssc = new StreamingContext(sparkConf, Seconds(1)) which creates only 1 thread. It should have atleast 2 threads: http://spark.apache.org/docs/latest/streaming-programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200593#comment-14200593 ] Norman He commented on SPARK-2447: -- Yes, I would like to help. Let me start with facade work in scala first. Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200602#comment-14200602 ] Ted Malaska commented on SPARK-2447: Cool thanks. Connect me at ted.mala...@cloudera.com and I'll set up a webex for some time in the future so we can get this going. Thanks Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4277) Support external shuffle service on Worker
Aaron Davidson created SPARK-4277: - Summary: Support external shuffle service on Worker Key: SPARK-4277 URL: https://issues.apache.org/jira/browse/SPARK-4277 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson It's less useful to have an external shuffle service on the Spark Standalone Worker than on YARN or Mesos (as executor allocations tend to be more static), but it would be good to help test the code path. It would also make Spark more resilient to particular executor failures. Cool side-feature: When SPARK-4236 is fixed and integrated, the Worker will take care of cleaning up executor directories, which will mean executors terminate more quickly and that we don't leak data if the executor dies forcefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4264) SQL HashJoin induces refCnt = 0 error in ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-4264. --- Resolution: Fixed SQL HashJoin induces refCnt = 0 error in ShuffleBlockFetcherIterator -- Key: SPARK-4264 URL: https://issues.apache.org/jira/browse/SPARK-4264 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Blocker This is because it calls hasNext twice, which invokes the completion iterator twice, unintuitively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4277) Support external shuffle service on Worker
[ https://issues.apache.org/jira/browse/SPARK-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200631#comment-14200631 ] Apache Spark commented on SPARK-4277: - User 'aarondav' has created a pull request for this issue: https://github.com/apache/spark/pull/3142 Support external shuffle service on Worker -- Key: SPARK-4277 URL: https://issues.apache.org/jira/browse/SPARK-4277 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson It's less useful to have an external shuffle service on the Spark Standalone Worker than on YARN or Mesos (as executor allocations tend to be more static), but it would be good to help test the code path. It would also make Spark more resilient to particular executor failures. Cool side-feature: When SPARK-4236 is fixed and integrated, the Worker will take care of cleaning up executor directories, which will mean executors terminate more quickly and that we don't leak data if the executor dies forcefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200634#comment-14200634 ] Patrick Wendell commented on SPARK-2447: Hey All, I have a question about this - is there any reason this can't exist as a user library instead of being merged into Spark itself? For these utility libraries like this, I could see ones coming for Cassandra, Mongo, etc... I don't see it scaling to put and maintain all of these in the Spark code base. At the same time however, they are super useful. As an alternative - what about if it was in HBase similar to e.g. the Hadoop InputFormat implementation? Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4249) A problem of EdgePartitionBuilder in Graphx
[ https://issues.apache.org/jira/browse/SPARK-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-4249. --- Resolution: Fixed Fix Version/s: 1.0.3 1.1.1 1.2.0 Issue resolved by pull request 3138 [https://github.com/apache/spark/pull/3138] A problem of EdgePartitionBuilder in Graphx --- Key: SPARK-4249 URL: https://issues.apache.org/jira/browse/SPARK-4249 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0 Reporter: Cookies Priority: Minor Fix For: 1.2.0, 1.1.1, 1.0.3 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartitionBuilder.scala#L48 in function toEdgePartition, code snippet index.update(srcIds(0), 0) the elements in array srcIds are all not initialized and are all 0. The effect is that if vertex ids don't contain 0, indexSize is equal to (realIndexSize + 1) and 0 is added to the `index` It seems that all versions have the problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200646#comment-14200646 ] Ted Malaska commented on SPARK-2447: I totally understand. I also don't know the answer. That is why I made the Github. Thankfully my employer also sees value in this project and it will be moving to Cloudera Labs in the coming weeks. All that means is I will have more help supporting it. A blog and more improvement will be coming in the coming weeks. Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4173) EdgePartitionBuilder uses wrong value for first clustered index
[ https://issues.apache.org/jira/browse/SPARK-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-4173. --- Resolution: Fixed Fix Version/s: 1.0.3 1.1.1 1.2.0 Issue resolved by pull request 3138 [https://github.com/apache/spark/pull/3138] EdgePartitionBuilder uses wrong value for first clustered index --- Key: SPARK-4173 URL: https://issues.apache.org/jira/browse/SPARK-4173 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: Ankur Dave Assignee: Ankur Dave Fix For: 1.2.0, 1.1.1, 1.0.3 Lines 48 and 49 in EdgePartitionBuilder reference {{srcIds}} before it has been initialized, causing an incorrect value to be stored for the first cluster. https://github.com/apache/spark/blob/23468e7e96bf047ba53806352558b9d661567b23/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartitionBuilder.scala#L48-49 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4278) SparkSQL job failing with java.lang.ClassCastException
Parviz Deyhim created SPARK-4278: Summary: SparkSQL job failing with java.lang.ClassCastException Key: SPARK-4278 URL: https://issues.apache.org/jira/browse/SPARK-4278 Project: Spark Issue Type: Bug Components: SQL Reporter: Parviz Deyhim The following job fails with the java.lang.ClassCastException error. Ideally SparkSQL should have the ability to ignore records that don't conform with the inferred schema. The steps that gets me to this error: 1) infer schema from a small subset of data 2) apply the schema to a larger dataset 3) do a simple join of two datasets sample code: {code} val sampleJson = sqlContext.jsonRDD(sc.textFile(.../dt=2014-10-10/file.snappy)) val mydata = sqlContext.jsonRDD(larger_dataset,sampleJson.schema) mydata.registerTempTable(mytable1) other dataset: val x = sc.textFile(.) case class Dataset(a:String,state:String, b:String, z:String, c:String, d:String) val xSchemaRDD = x.map(_.split(\t)).map(f=Dataset(f(0),f(1),f(2),f(3),f(4),f(5))) xSchemaRDD.registerTempTable(mytable2) {code} java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:389) org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:397) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$4.apply(JsonRDD.scala:410) scala.Option.map(Option.scala:145) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:409) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:407) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:407) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:398) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$4.apply(JsonRDD.scala:410) scala.Option.map(Option.scala:145) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:409) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:407) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:407) org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:398) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$4.apply(JsonRDD.scala:410) scala.Option.map(Option.scala:145) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:409) org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:407) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:407) org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:41) org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:41) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209) org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200686#comment-14200686 ] Josh Rosen commented on SPARK-677: -- No, it's still an issue in 1.2.0: {code} def collect(self): Return a list that contains all of the elements in this RDD. with SCCallSiteSync(self.context) as css: bytesInJava = self._jrdd.collect().iterator() return list(self._collect_iterator_through_file(bytesInJava)) def _collect_iterator_through_file(self, iterator): # Transferring lots of data through Py4J can be slow because # socket.readline() is inefficient. Instead, we'll dump the data to a # file and read it back. tempFile = NamedTemporaryFile(delete=False, dir=self.ctx._temp_dir) tempFile.close() self.ctx._writeToFile(iterator, tempFile.name) # Read the data into Python and deserialize it: with open(tempFile.name, 'rb') as tempFile: for item in self._jrdd_deserializer.load_stream(tempFile): yield item os.unlink(tempFile.name) {code} PySpark should not collect results through local filesystem --- Key: SPARK-677 URL: https://issues.apache.org/jira/browse/SPARK-677 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 0.7.0 Reporter: Josh Rosen Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance. Instead, we should stream the data from Java to Python over a local socket or a FIFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4279) Implementing TinkerPop on top of GraphX
Brennon York created SPARK-4279: --- Summary: Implementing TinkerPop on top of GraphX Key: SPARK-4279 URL: https://issues.apache.org/jira/browse/SPARK-4279 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Brennon York Priority: Minor [TinkerPop|https://github.com/tinkerpop] is a great abstraction for graph databases and has been implemented across various graph database backends. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks
Sandy Ryza created SPARK-4280: - Summary: In dynamic allocation, add option to never kill executors with cached blocks Key: SPARK-4280 URL: https://issues.apache.org/jira/browse/SPARK-4280 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Even with the external shuffle service, this is useful in situations like Hive on Spark where a query might require caching some data. We want to be able to give back executors after the job ends, but not during the job if it would delete intermediate results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200762#comment-14200762 ] Josh Rosen commented on SPARK-4133: --- We can't remove that StreamingContext constructor since we don't want to break binary compatibility, but we can add a warning / raise an error. I'm working on a patch for this now. PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0 -- Key: SPARK-4133 URL: https://issues.apache.org/jira/browse/SPARK-4133 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: Antonio Jesus Navarro Priority: Blocker Attachments: spark_ex.logs Snappy related problems found when trying to upgrade existing Spark Streaming App from 1.0.2 to 1.1.0. We can not run an existing 1.0.2 spark app if upgraded to 1.1.0 IOException is thrown by snappy (parsing_error(2)) {code} Executor task launch worker-0 DEBUG storage.BlockManager - Getting local block broadcast_0 Executor task launch worker-0 DEBUG storage.BlockManager - Level for block broadcast_0 is StorageLevel(true, true, false, true, 1) Executor task launch worker-0 DEBUG storage.BlockManager - Getting block broadcast_0 from memory Executor task launch worker-0 DEBUG storage.BlockManager - Getting local block broadcast_0 Executor task launch worker-0 DEBUG executor.Executor - Task 0's epoch is 0 Executor task launch worker-0 DEBUG storage.BlockManager - Block broadcast_0 not registered locally Executor task launch worker-0 INFO broadcast.TorrentBroadcast - Started reading broadcast variable 0 sparkDriver-akka.actor.default-dispatcher-4 INFO receiver.ReceiverSupervisorImpl - Registered receiver 0 Executor task launch worker-0 INFO util.RecurringTimer - Started timer for BlockGenerator at time 1414656492400 Executor task launch worker-0 INFO receiver.BlockGenerator - Started BlockGenerator Thread-87 INFO receiver.BlockGenerator - Started block pushing thread Executor task launch worker-0 INFO receiver.ReceiverSupervisorImpl - Starting receiver sparkDriver-akka.actor.default-dispatcher-5 INFO scheduler.ReceiverTracker - Registered receiver for stream 0 from akka://sparkDriver Executor task launch worker-0 INFO kafka.KafkaReceiver - Starting Kafka Consumer Stream with group: stratioStreaming Executor task launch worker-0 INFO kafka.KafkaReceiver - Connecting to Zookeeper: node.stratio.com:2181 sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] handled message (8.442354 ms) StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] handled message (8.412421 ms) StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] handled message (8.385471 ms) StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] Executor task launch worker-0 INFO utils.VerifiableProperties - Verifying properties Executor task launch worker-0 INFO utils.VerifiableProperties - Property group.id is overridden to stratioStreaming Executor task launch worker-0 INFO utils.VerifiableProperties - Property zookeeper.connect is overridden to node.stratio.com:2181 Executor task launch worker-0 INFO utils.VerifiableProperties - Property zookeeper.connection.timeout.ms is overridden to 1 Executor task launch worker-0 INFO broadcast.TorrentBroadcast - Reading broadcast variable 0 took 0.033998997 s Executor task launch worker-0 INFO consumer.ZookeeperConsumerConnector - [stratioStreaming_ajn-stratio-1414656492293-8ecb3e3a], Connecting to zookeeper instance at node.stratio.com:2181 Executor task launch worker-0 DEBUG zkclient.ZkConnection - Creating new ZookKeeper instance to connect to node.stratio.com:2181. ZkClient-EventThread-169-node.stratio.com:2181 INFO zkclient.ZkEventThread - Starting ZkClient event thread.
[jira] [Commented] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks
[ https://issues.apache.org/jira/browse/SPARK-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200785#comment-14200785 ] Andrew Or commented on SPARK-4280: -- Hey [~sandyr] does this require the application to explicitly uncache the RDDs? My concern is that we cache some blocks behind the application's back (e.g. broadcast, streaming blocks) and we don't currently display these on the UI, in which case the application will never remove executors. It seems that some mechanism that would unconditionally blow away all the blocks on an executor will be handy. In dynamic allocation, add option to never kill executors with cached blocks Key: SPARK-4280 URL: https://issues.apache.org/jira/browse/SPARK-4280 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Even with the external shuffle service, this is useful in situations like Hive on Spark where a query might require caching some data. We want to be able to give back executors after the job ends, but not during the job if it would delete intermediate results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks
[ https://issues.apache.org/jira/browse/SPARK-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200801#comment-14200801 ] Sandy Ryza commented on SPARK-4280: --- My thinking was that it would just be based on whether the node reports it's storing blocks. So, e.g., applications wouldn't need to explicitly uncache the RDDs in situations where Spark garbage collects them because their references go away. Ideally we would make a special case for broadcast variables, because they're not unique to any node. Will look into whether there's a good way to do this. In dynamic allocation, add option to never kill executors with cached blocks Key: SPARK-4280 URL: https://issues.apache.org/jira/browse/SPARK-4280 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Even with the external shuffle service, this is useful in situations like Hive on Spark where a query might require caching some data. We want to be able to give back executors after the job ends, but not during the job if it would delete intermediate results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4281) Yarn shuffle service jars need to include dependencies
Andrew Or created SPARK-4281: Summary: Yarn shuffle service jars need to include dependencies Key: SPARK-4281 URL: https://issues.apache.org/jira/browse/SPARK-4281 Project: Spark Issue Type: Bug Components: Build, YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Patrick Wendell Priority: Blocker When we package we only get the jars with the classes. We need to make an assembly jar for the network-yarn module that includes all of its dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4148) PySpark's sample uses the same seed for all partitions
[ https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-4148: -- reopen this issue because branch-1.0 is not fixed PySpark's sample uses the same seed for all partitions -- Key: SPARK-4148 URL: https://issues.apache.org/jira/browse/SPARK-4148 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2, 1.1.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.1.1, 1.2.0 The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1. {code} In [14]: import random In [15]: r1 = random.Random(10) In [16]: r1.randint(0, 1) Out[16]: 1 In [17]: r1.random() Out[17]: 0.4288890546751146 In [18]: r1.random() Out[18]: 0.5780913011344704 In [19]: r2 = random.Random(10) In [20]: r2.randint(0, 1) Out[20]: 1 In [21]: r2.randint(0, 1) Out[21]: 0 In [22]: r2.random() Out[22]: 0.5780913011344704 {code} So the second value from partition 1 is the same as the first value from partition 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4148) PySpark's sample uses the same seed for all partitions
[ https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4148: - Fix Version/s: 1.2.0 PySpark's sample uses the same seed for all partitions -- Key: SPARK-4148 URL: https://issues.apache.org/jira/browse/SPARK-4148 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2, 1.1.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.1.1, 1.2.0 The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1. {code} In [14]: import random In [15]: r1 = random.Random(10) In [16]: r1.randint(0, 1) Out[16]: 1 In [17]: r1.random() Out[17]: 0.4288890546751146 In [18]: r1.random() Out[18]: 0.5780913011344704 In [19]: r2 = random.Random(10) In [20]: r2.randint(0, 1) Out[20]: 1 In [21]: r2.randint(0, 1) Out[21]: 0 In [22]: r2.random() Out[22]: 0.5780913011344704 {code} So the second value from partition 1 is the same as the first value from partition 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4282) Stopping flag in YarnClientSchedulerBackend should be volatile
Kousuke Saruta created SPARK-4282: - Summary: Stopping flag in YarnClientSchedulerBackend should be volatile Key: SPARK-4282 URL: https://issues.apache.org/jira/browse/SPARK-4282 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Kousuke Saruta In YarnClientSchedulerBackend, a variable stopping is used as a flag and it's accessed by some threads so it should be volatile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4282) Stopping flag in YarnClientSchedulerBackend should be volatile
[ https://issues.apache.org/jira/browse/SPARK-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201032#comment-14201032 ] Apache Spark commented on SPARK-4282: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/3143 Stopping flag in YarnClientSchedulerBackend should be volatile -- Key: SPARK-4282 URL: https://issues.apache.org/jira/browse/SPARK-4282 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Kousuke Saruta In YarnClientSchedulerBackend, a variable stopping is used as a flag and it's accessed by some threads so it should be volatile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201042#comment-14201042 ] Patrick Wendell commented on SPARK-3561: Hey [~ozhurakousky] - as I said before, I just don't think it's a good idea to make the internals of Spark execution pluggable, given our approach to API compatibility. This is attempting to opening up many internal API's. One of the main reasons for this proposal was to have better elasticity in YARN. Towards that end, can you try out the new elastic scaling code for YARN (SPARK-3174)? It will ship in Spark 1.2 and be one of the main new features. This integrates nicely with YARN's native shuffle service. In fact IIRC the main design of the shuffle service in YARN was specifically for this purpose. In that way, I think elements of this proposal are indeed making it into Spark. In terms of breaking out the initialization of a SparkContext - that is probably a good idea (it's been discussed separately). Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4283) Spark source code does not correctly import into eclipse
Yang Yang created SPARK-4283: Summary: Spark source code does not correctly import into eclipse Key: SPARK-4283 URL: https://issues.apache.org/jira/browse/SPARK-4283 Project: Spark Issue Type: Bug Components: Build Reporter: Yang Yang Priority: Minor when I import spark src into eclipse, either by mvn eclipse:eclipse, then import existing general projects or import existing maven projects, it does not recognize the project as a scala project. I am adding a new plugin , so import works -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4283) Spark source code does not correctly import into eclipse
[ https://issues.apache.org/jira/browse/SPARK-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Yang updated SPARK-4283: - Attachment: spark_eclipse.diff patch for all the pom files Spark source code does not correctly import into eclipse Key: SPARK-4283 URL: https://issues.apache.org/jira/browse/SPARK-4283 Project: Spark Issue Type: Bug Components: Build Reporter: Yang Yang Priority: Minor Attachments: spark_eclipse.diff when I import spark src into eclipse, either by mvn eclipse:eclipse, then import existing general projects or import existing maven projects, it does not recognize the project as a scala project. I am adding a new plugin , so import works -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
[ https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201077#comment-14201077 ] Apache Spark commented on SPARK-3797: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/3144 Run the shuffle service inside the YARN NodeManager as an AuxiliaryService -- Key: SPARK-3797 URL: https://issues.apache.org/jira/browse/SPARK-3797 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 1.1.0 Reporter: Patrick Wendell Assignee: Andrew Or Fix For: 1.2.0 It's also worth considering running the shuffle service in a YARN container beside the executor(s) on each node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types
Joseph K. Bradley created SPARK-4284: Summary: BinaryClassificationMetrics precision-recall method names should correspond to return types Key: SPARK-4284 URL: https://issues.apache.org/jira/browse/SPARK-4284 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor BinaryClassificationMetrics has several methods which work with (recall, precision) pairs, but the method names all use the wrong order (pr). This order should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4286) Support External Shuffle Service with Mesos integration
Timothy Chen created SPARK-4286: --- Summary: Support External Shuffle Service with Mesos integration Key: SPARK-4286 URL: https://issues.apache.org/jira/browse/SPARK-4286 Project: Spark Issue Type: Task Reporter: Timothy Chen With the new external shuffle service added, we need to also make the Mesos integration able to launch the shuffle service and support the auto scaling executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4285) Transpose RDD[Vector] to column store for ML
Joseph K. Bradley created SPARK-4285: Summary: Transpose RDD[Vector] to column store for ML Key: SPARK-4285 URL: https://issues.apache.org/jira/browse/SPARK-4285 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor For certain ML algorithms, a column store is more efficient than a row store (which is currently used everywhere). E.g., deep decision trees can be faster to train when partitioning by features. Proposal: Provide a method with the following API (probably in util/): ``` def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)] ``` The input Vectors will be data rows/instances, and the output Vectors will be columns/features paired with column/feature indices. **Question**: Is it important to maintain matrix structure? That is, should output Vectors in the same partition be adjacent columns in the matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4286) Support External Shuffle Service with Mesos integration
[ https://issues.apache.org/jira/browse/SPARK-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen updated SPARK-4286: Component/s: Mesos Support External Shuffle Service with Mesos integration --- Key: SPARK-4286 URL: https://issues.apache.org/jira/browse/SPARK-4286 Project: Spark Issue Type: Task Components: Mesos Reporter: Timothy Chen With the new external shuffle service added, we need to also make the Mesos integration able to launch the shuffle service and support the auto scaling executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4286) Support External Shuffle Service with Mesos integration
[ https://issues.apache.org/jira/browse/SPARK-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201312#comment-14201312 ] Timothy Chen commented on SPARK-4286: - Please assign to me, thanks. Support External Shuffle Service with Mesos integration --- Key: SPARK-4286 URL: https://issues.apache.org/jira/browse/SPARK-4286 Project: Spark Issue Type: Task Components: Mesos Reporter: Timothy Chen With the new external shuffle service added, we need to also make the Mesos integration able to launch the shuffle service and support the auto scaling executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4285) Transpose RDD[Vector] to column store for ML
[ https://issues.apache.org/jira/browse/SPARK-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4285: - Issue Type: Sub-task (was: New Feature) Parent: SPARK-3717 Transpose RDD[Vector] to column store for ML Key: SPARK-4285 URL: https://issues.apache.org/jira/browse/SPARK-4285 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Priority: Minor For certain ML algorithms, a column store is more efficient than a row store (which is currently used everywhere). E.g., deep decision trees can be faster to train when partitioning by features. Proposal: Provide a method with the following API (probably in util/): ``` def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)] ``` The input Vectors will be data rows/instances, and the output Vectors will be columns/features paired with column/feature indices. **Question**: Is it important to maintain matrix structure? That is, should output Vectors in the same partition be adjacent columns in the matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4285) Transpose RDD[Vector] to column store for ML
[ https://issues.apache.org/jira/browse/SPARK-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4285: - Assignee: Joseph K. Bradley Transpose RDD[Vector] to column store for ML Key: SPARK-4285 URL: https://issues.apache.org/jira/browse/SPARK-4285 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor For certain ML algorithms, a column store is more efficient than a row store (which is currently used everywhere). E.g., deep decision trees can be faster to train when partitioning by features. Proposal: Provide a method with the following API (probably in util/): ``` def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)] ``` The input Vectors will be data rows/instances, and the output Vectors will be columns/features paired with column/feature indices. **Question**: Is it important to maintain matrix structure? That is, should output Vectors in the same partition be adjacent columns in the matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3066: - Target Version/s: (was: 1.2.0) Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4277) Support external shuffle service on Worker
[ https://issues.apache.org/jira/browse/SPARK-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4277. Resolution: Fixed Fix Version/s: 1.2.0 Support external shuffle service on Worker -- Key: SPARK-4277 URL: https://issues.apache.org/jira/browse/SPARK-4277 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 1.2.0 It's less useful to have an external shuffle service on the Spark Standalone Worker than on YARN or Mesos (as executor allocations tend to be more static), but it would be good to help test the code path. It would also make Spark more resilient to particular executor failures. Cool side-feature: When SPARK-4236 is fixed and integrated, the Worker will take care of cleaning up executor directories, which will mean executors terminate more quickly and that we don't leak data if the executor dies forcefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4277) Support external shuffle service on Worker
[ https://issues.apache.org/jira/browse/SPARK-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4277: - Affects Version/s: 1.2.0 Support external shuffle service on Worker -- Key: SPARK-4277 URL: https://issues.apache.org/jira/browse/SPARK-4277 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 1.2.0 It's less useful to have an external shuffle service on the Spark Standalone Worker than on YARN or Mesos (as executor allocations tend to be more static), but it would be good to help test the code path. It would also make Spark more resilient to particular executor failures. Cool side-feature: When SPARK-4236 is fixed and integrated, the Worker will take care of cleaning up executor directories, which will mean executors terminate more quickly and that we don't leak data if the executor dies forcefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201378#comment-14201378 ] zzc commented on SPARK-2468: @Aaron Davidson, Thank you for your recommendation. By the way, PR #3101 can be merged into master today? @Lianhui Wang, I haven't tested it. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4287) Add TTL-based cleanup in external shuffle service
Andrew Or created SPARK-4287: Summary: Add TTL-based cleanup in external shuffle service Key: SPARK-4287 URL: https://issues.apache.org/jira/browse/SPARK-4287 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or A problem with long-running SparkContexts using the external shuffle service is that its shuffle files may never be cleaned. This is because the ContextCleaner may no longer have executors to go through to clean up these files (they may be killed intentionally). We should have a TTL-based timeout that does this as a backup, defaulting to a timeout of a week or something. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib
Guoqiang Li created SPARK-4288: -- Summary: Add Sparse Autoencoder algorithm to MLlib Key: SPARK-4288 URL: https://issues.apache.org/jira/browse/SPARK-4288 Project: Spark Issue Type: Bug Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.
Corey J. Nolet created SPARK-4289: - Summary: Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance. Key: SPARK-4289 URL: https://issues.apache.org/jira/browse/SPARK-4289 Project: Spark Issue Type: Bug Reporter: Corey J. Nolet This one is easy to reproduce. preval job = new Job(sc.hadoopConfiguration)/pre I'm not sure what the solution would be off hand as it's happening when the shell is calling toString() on the instance of Job. The problem is, because of the failure, the instance is never actually assigned to the job val. java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283) at org.apache.hadoop.mapreduce.Job.toString(Job.java:452) at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324) at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329) at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337) at .init(console:10) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org