[jira] [Resolved] (SPARK-2630) Input data size of CoalescedRDD is incorrect
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-2630. --- Resolution: Fixed Fix Version/s: 1.2.0 Input data size of CoalescedRDD is incorrect Key: SPARK-2630 URL: https://issues.apache.org/jira/browse/SPARK-2630 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Assignee: Andrew Ash Priority: Blocker Fix For: 1.2.0 Attachments: overflow.tiff Given one big file, such as text.4.3G, put it in one task, {code} sc.textFile(text.4.3.G).coalesce(1).count() {code} In Web UI of Spark, you will see that the input size is 5.4M. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2630) Input data size of CoalescedRDD is incorrect
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reopened SPARK-2630: --- not merged yet, sorry. Input data size of CoalescedRDD is incorrect Key: SPARK-2630 URL: https://issues.apache.org/jira/browse/SPARK-2630 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Assignee: Andrew Ash Priority: Blocker Fix For: 1.2.0 Attachments: overflow.tiff Given one big file, such as text.4.3G, put it in one task, {code} sc.textFile(text.4.3.G).coalesce(1).count() {code} In Web UI of Spark, you will see that the input size is 5.4M. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...
[ https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157765#comment-14157765 ] Ángel Álvarez commented on SPARK-2256: -- It seems the problem has been solved in Spark 1.1.0 !!! pyspark: RDD.take doesn't work ... sometimes ... -- Key: SPARK-2256 URL: https://issues.apache.org/jira/browse/SPARK-2256 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: local file/remote HDFS Reporter: Ángel Álvarez Labels: RDD, pyspark, take, windows Attachments: A_test.zip If I try to take some lines from a file, sometimes it doesn't work Code: myfile = sc.textFile(A_ko) print myfile.take(10) Stacktrace: 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19 Traceback (most recent call last): File mytest.py, line 19, in module print myfile.take(10) File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, line 537, in __call__ File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, line 300, in get_return_value Test data: START TEST DATA A A A
[jira] [Created] (SPARK-3775) Not suitable error message in spark-shell.cmd
Masayoshi TSUZUKI created SPARK-3775: Summary: Not suitable error message in spark-shell.cmd Key: SPARK-3775 URL: https://issues.apache.org/jira/browse/SPARK-3775 Project: Spark Issue Type: Improvement Reporter: Masayoshi TSUZUKI Priority: Trivial In Windows environment. When we execute bin\spark-shell.cmd before we build spark, we get the error message like this. {quote} Failed to find Spark assembly JAR. You need to build Spark with sbt\sbt assembly before running this program. {quote} But this message is not suitable because ... * Maven is also available to build Spark, and it works in Windows without cygwin now ([SPARK-3061]). * The equivalent error message of linux version (bin/spark-shell) doesn't mention the way to build. bq. You need to build Spark before running this program. * sbt\sbt can't be executed in Windows without cygwin because it's bash script. So this message should be modified as same as the linux version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3775) Not suitable error message in spark-shell.cmd
[ https://issues.apache.org/jira/browse/SPARK-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157818#comment-14157818 ] Apache Spark commented on SPARK-3775: - User 'tsudukim' has created a pull request for this issue: https://github.com/apache/spark/pull/2640 Not suitable error message in spark-shell.cmd - Key: SPARK-3775 URL: https://issues.apache.org/jira/browse/SPARK-3775 Project: Spark Issue Type: Improvement Reporter: Masayoshi TSUZUKI Priority: Trivial In Windows environment. When we execute bin\spark-shell.cmd before we build spark, we get the error message like this. {quote} Failed to find Spark assembly JAR. You need to build Spark with sbt\sbt assembly before running this program. {quote} But this message is not suitable because ... * Maven is also available to build Spark, and it works in Windows without cygwin now ([SPARK-3061]). * The equivalent error message of linux version (bin/spark-shell) doesn't mention the way to build. bq. You need to build Spark before running this program. * sbt\sbt can't be executed in Windows without cygwin because it's bash script. So this message should be modified as same as the linux version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3366) Compute best splits distributively in decision tree
[ https://issues.apache.org/jira/browse/SPARK-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3366. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2595 [https://github.com/apache/spark/pull/2595] Compute best splits distributively in decision tree --- Key: SPARK-3366 URL: https://issues.apache.org/jira/browse/SPARK-3366 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Qiping Li Fix For: 1.2.0 The current implementation computes all best splits locally on the driver, which makes the driver a bottleneck for both communication and computation. It would be nice if we can compute the best splits distributively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3776) Wrong conversion to Catalyst for Option[Product]
Renat Yusupov created SPARK-3776: Summary: Wrong conversion to Catalyst for Option[Product] Key: SPARK-3776 URL: https://issues.apache.org/jira/browse/SPARK-3776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Renat Yusupov Fix For: 1.2.0 Method ScalaReflection.convertToCatalyst make wrong conversion for Option[Product] data. For example: case class A(intValue: Int) case class B(optionA: Option[A]) val b = B(Some(A(5))) ScalaReflection.convertToCatalyst(b) returns Seq(A(5)) instead of Seq(Seq(5)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3776) Wrong conversion to Catalyst for Option[Product]
[ https://issues.apache.org/jira/browse/SPARK-3776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157913#comment-14157913 ] Apache Spark commented on SPARK-3776: - User 'r3natko' has created a pull request for this issue: https://github.com/apache/spark/pull/2641 Wrong conversion to Catalyst for Option[Product] Key: SPARK-3776 URL: https://issues.apache.org/jira/browse/SPARK-3776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Renat Yusupov Fix For: 1.2.0 Method ScalaReflection.convertToCatalyst make wrong conversion for Option[Product] data. For example: case class A(intValue: Int) case class B(optionA: Option[A]) val b = B(Some(A(5))) ScalaReflection.convertToCatalyst(b) returns Seq(A(5)) instead of Seq(Seq(5)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2421) Spark should treat writable as serializable for keys
[ https://issues.apache.org/jira/browse/SPARK-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157936#comment-14157936 ] Brian Husted commented on SPARK-2421: - To work around the problem, one must map the Writable to a String (org.apache.hadoop.io.Text in the case below). This an issue when sorting large amounts of data since Spark will attempt to write out the entire dataset (spill) to perform the data conversion. On a 500GB file this fills up more than 100GB of space on each node in our 12 node cluster which is very inefficient. We are currently using Spark 1.0.2. Any thoughts here are appreciated. Our code that attempts to mimic map/reduce sort in Spark: //read in the hadoop sequence file to sort val file = sc.sequenceFile(input, classOf[Text], classOf[Text]) //this is the code we would like to avoid that maps the Hadoop Text Input to Strings so the sortyByKey will run file.map{ case (k,v) = (k.toString(), v.toString())} //perform the sort on the converted data val sortedOutput = file.sortByKey(true, 1) //write out the results as a sequence file sortedOutput.saveAsSequenceFile(output, Some(classOf[DefaultCodec])) Spark should treat writable as serializable for keys Key: SPARK-2421 URL: https://issues.apache.org/jira/browse/SPARK-2421 Project: Spark Issue Type: Improvement Components: Input/Output, Java API Affects Versions: 1.0.0 Reporter: Xuefu Zhang It seems that Spark requires the key be serializable (class implement Serializable interface). In Hadoop world, Writable interface is used for the same purpose. A lot of existing classes, while writable, are not considered by Spark as Serializable. It would be nice if Spark can treate Writable as serializable and automatically serialize and de-serialize these classes using writable interface. This is identified in HIVE-7279, but its benefits are seen global. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3777) Display Executor ID for Tasks in Stage page
Shixiong Zhu created SPARK-3777: --- Summary: Display Executor ID for Tasks in Stage page Key: SPARK-3777 URL: https://issues.apache.org/jira/browse/SPARK-3777 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0, 1.0.2, 1.0.0 Reporter: Shixiong Zhu Priority: Minor Now the Stage page only displays Executor(host) for tasks. However, there may be more than one Executors running in the same host. Currently, when some task is hung, I only know the host of the faulty executor. Therefore I have to check all executors in the host. Adding Executor ID would be helpful to locate the faulty executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3777) Display Executor ID for Tasks in Stage page
[ https://issues.apache.org/jira/browse/SPARK-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157944#comment-14157944 ] Apache Spark commented on SPARK-3777: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/2642 Display Executor ID for Tasks in Stage page - Key: SPARK-3777 URL: https://issues.apache.org/jira/browse/SPARK-3777 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0, 1.0.2, 1.1.0 Reporter: Shixiong Zhu Priority: Minor Labels: easy Now the Stage page only displays Executor(host) for tasks. However, there may be more than one Executors running in the same host. Currently, when some task is hung, I only know the host of the faulty executor. Therefore I have to check all executors in the host. Adding Executor ID would be helpful to locate the faulty executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3764. -- Resolution: Not a Problem Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path
[ https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157946#comment-14157946 ] Tom Weber commented on SPARK-3769: -- I believe I originally called it on the driver side, but the addfile call makes a local copy, so when you call it there, you get the local copy path which isn't the same path as where it ends up on the remote worker nodes. I'm good with striping the path off and only passing the file name itself to the get call. SparkFiles.get gives me the wrong fully qualified path -- Key: SPARK-3769 URL: https://issues.apache.org/jira/browse/SPARK-3769 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0 Environment: linux host, and linux grid. Reporter: Tom Weber Priority: Minor My spark pgm running on my host, (submitting work to my grid). JavaSparkContext sc =new JavaSparkContext(conf); final String path = args[1]; sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */ The log shows: 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986 those are paths on my host machine. The location that this file gets on grid nodes is: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas While the call to get the path in my code that runs in my mapPartitions function on the grid nodes is: String pgm = SparkFiles.get(path); And this returns the following string: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas So, am I expected to take the qualified path that was given to me and parse it to get only the file name at the end, and then concatenate that to what I get from the SparkFiles.getRootDirectory() call in order to get this to work? Or pass only the parsed file name to the SparkFiles.get method? Seems as though I should be able to pass the same file specification to both sc.addFile() and SparkFiles.get() and get the correct location of the file. Thanks, Tom -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
Thomas Graves created SPARK-3778: Summary: newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn Key: SPARK-3778 URL: https://issues.apache.org/jira/browse/SPARK-3778 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Thomas Graves The newAPIHadoopRDD routine doesn't properly add the credentials to the conf to be able to access secure hdfs. Note that newAPIHadoopFile does handle these because the org.apache.hadoop.mapreduce.Job automatically adds it for you. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...
[ https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-2256. Resolution: Fixed Fix Version/s: 1.1.0 pyspark: RDD.take doesn't work ... sometimes ... -- Key: SPARK-2256 URL: https://issues.apache.org/jira/browse/SPARK-2256 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: local file/remote HDFS Reporter: Ángel Álvarez Labels: RDD, pyspark, take, windows Fix For: 1.1.0 Attachments: A_test.zip If I try to take some lines from a file, sometimes it doesn't work Code: myfile = sc.textFile(A_ko) print myfile.take(10) Stacktrace: 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19 Traceback (most recent call last): File mytest.py, line 19, in module print myfile.take(10) File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, line 537, in __call__ File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, line 300, in get_return_value Test data: START TEST DATA A A A
[jira] [Created] (SPARK-3779) yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period
Thomas Graves created SPARK-3779: Summary: yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period Key: SPARK-3779 URL: https://issues.apache.org/jira/browse/SPARK-3779 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves in pr https://github.com/apache/spark/pull/2577 I added support to use spark.yarn.applicationMaster.waitTries to client mode. But the time it waits between loops is different so it could be confusing to the user. We also don't document how long each loop is in the documentation so this config really isn't clear. We should just changed this config to be time based, ms or seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3780) YarnAllocator should look at the container completed diagnostic message
Thomas Graves created SPARK-3780: Summary: YarnAllocator should look at the container completed diagnostic message Key: SPARK-3780 URL: https://issues.apache.org/jira/browse/SPARK-3780 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves Yarn will give us a diagnostic message along with a container complete notification. We should print that diagnostic message for the spark user. For instance, I believe if it the container gets shot for being over its memory limit yarn would give us a useful diagnostic saying that. This would be really useful for the user to be able to see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3781) code style format
sjk created SPARK-3781: -- Summary: code style format Key: SPARK-3781 URL: https://issues.apache.org/jira/browse/SPARK-3781 Project: Spark Issue Type: Improvement Reporter: sjk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3781) code style format
[ https://issues.apache.org/jira/browse/SPARK-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158074#comment-14158074 ] Apache Spark commented on SPARK-3781: - User 'shijinkui' has created a pull request for this issue: https://github.com/apache/spark/pull/2644 code style format - Key: SPARK-3781 URL: https://issues.apache.org/jira/browse/SPARK-3781 Project: Spark Issue Type: Improvement Reporter: sjk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3783) The type parameters for SparkContext.accumulable are inconsistent Accumulable itself
[ https://issues.apache.org/jira/browse/SPARK-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158176#comment-14158176 ] Nathan Kronenfeld commented on SPARK-3783: -- https://github.com/apache/spark/pull/2637 The type parameters for SparkContext.accumulable are inconsistent Accumulable itself Key: SPARK-3783 URL: https://issues.apache.org/jira/browse/SPARK-3783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Nathan Kronenfeld Priority: Minor Original Estimate: 10m Remaining Estimate: 10m SparkContext.accumulable takes type parameters [T, R] - and passes them to accumulable, in that order. Accumulable takes type parameters [R, T]. So T for SparkContext.accumulable corresponds with R for Accumulable and vice versa. Minor, but very confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3785) Support off-loading computations to a GPU
Thomas Darimont created SPARK-3785: -- Summary: Support off-loading computations to a GPU Key: SPARK-3785 URL: https://issues.apache.org/jira/browse/SPARK-3785 Project: Spark Issue Type: Brainstorming Components: MLlib Reporter: Thomas Darimont Priority: Minor Are there any plans to adding support for off-loading computations to the GPU, e.g. via an open-cl binding? http://www.jocl.org/ https://code.google.com/p/javacl/ http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2058) SPARK_CONF_DIR should override all present configs
[ https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-2058. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 SPARK_CONF_DIR should override all present configs -- Key: SPARK-2058 URL: https://issues.apache.org/jira/browse/SPARK-2058 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0, 1.0.1, 1.1.0 Reporter: Eugen Cepoi Assignee: Eugen Cepoi Priority: Critical Fix For: 1.1.1, 1.2.0 When the user defines SPARK_CONF_DIR I think spark should use all the configs available there not only spark-env. This involves changing SparkSubmitArguments to first read from SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the computed classpath for configs such as log4j, metrics, etc. I have already prepared a PR for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2058) SPARK_CONF_DIR should override all present configs
[ https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2058: - Assignee: Eugen Cepoi SPARK_CONF_DIR should override all present configs -- Key: SPARK-2058 URL: https://issues.apache.org/jira/browse/SPARK-2058 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0, 1.0.1, 1.1.0 Reporter: Eugen Cepoi Assignee: Eugen Cepoi Priority: Critical Fix For: 1.1.1, 1.2.0 When the user defines SPARK_CONF_DIR I think spark should use all the configs available there not only spark-env. This involves changing SparkSubmitArguments to first read from SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the computed classpath for configs such as log4j, metrics, etc. I have already prepared a PR for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3706) Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset
[ https://issues.apache.org/jira/browse/SPARK-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158200#comment-14158200 ] Josh Rosen commented on SPARK-3706: --- This introduced a problem, since after this patch we now use IPython on the workers; see SPARK-3772 for more details. Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset Key: SPARK-3706 URL: https://issues.apache.org/jira/browse/SPARK-3706 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0 Reporter: cocoatomo Labels: pyspark Fix For: 1.2.0 h3. Problem The section Using the shell in Spark Programming Guide (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) says that we can run pyspark REPL through IPython. But a folloing command does not run IPython but a default Python executable. {quote} $ IPYTHON=1 ./bin/pyspark Python 2.7.8 (default, Jul 2 2014, 10:14:46) ... {quote} the spark/bin/pyspark script on the commit b235e013638685758885842dc3268e9800af3678 decides which executable and options it use folloing way. # if PYSPARK_PYTHON unset #* → defaulting to python # if IPYTHON_OPTS set #* → set IPYTHON 1 # some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit #* out of this issues scope # if IPYTHON set as 1 #* → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS #* otherwise execute $PYSPARK_PYTHON Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is 1. In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no effect on decide which command to use. ||PYSPARK_PYTHON||IPYTHON_OPTS||IPYTHON||resulting command||expected command|| |(unset → defaults to python)|(unset)|(unset)|python|(same)| |(unset → defaults to python)|(unset)|1|python|ipython| |(unset → defaults to python)|an_option|(unset → set to 1)|python an_option|ipython an_option| |(unset → defaults to python)|an_option|1|python an_option|ipython an_option| |ipython|(unset)|(unset)|ipython|(same)| |ipython|(unset)|1|ipython|(same)| |ipython|an_option|(unset → set to 1)|ipython an_option|(same)| |ipython|an_option|1|ipython an_option|(same)| h3. Suggestion The pyspark script should determine firstly whether a user wants to run IPython or other executables. # if IPYTHON_OPTS set #* set IPYTHON 1 # if IPYTHON has a value 1 #* PYSPARK_PYTHON defaults to ipython if not set # PYSPARK_PYTHON defaults to python if not set See the pull request for more detailed modification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2058) SPARK_CONF_DIR should override all present configs
[ https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158197#comment-14158197 ] Andrew Or commented on SPARK-2058: -- To give a quick update, this change has not made it to any releases yet. It will be in the future releases 1.1.1 and 1.2.0, however. SPARK_CONF_DIR should override all present configs -- Key: SPARK-2058 URL: https://issues.apache.org/jira/browse/SPARK-2058 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0, 1.0.1, 1.1.0 Reporter: Eugen Cepoi Assignee: Eugen Cepoi Priority: Critical Fix For: 1.1.1, 1.2.0 When the user defines SPARK_CONF_DIR I think spark should use all the configs available there not only spark-env. This involves changing SparkSubmitArguments to first read from SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the computed classpath for configs such as log4j, metrics, etc. I have already prepared a PR for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3786) Speedup tests of PySpark
Davies Liu created SPARK-3786: - Summary: Speedup tests of PySpark Key: SPARK-3786 URL: https://issues.apache.org/jira/browse/SPARK-3786 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu It takes about 20 minutes (about 25% of all the tests) to run all the tests of PySpark. The slowest ones are tests.py and streaming/tests.py, they create new JVM and SparkContext for each test cases, it will be faster to reuse the SparkContext for most of cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3696) Do not override user-defined conf_dir in spark-config.sh
[ https://issues.apache.org/jira/browse/SPARK-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3696: - Assignee: WangTaoTheTonic Do not override user-defined conf_dir in spark-config.sh Key: SPARK-3696 URL: https://issues.apache.org/jira/browse/SPARK-3696 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Priority: Minor Fix For: 1.1.1, 1.2.0 Now many scripts used spark-config.sh in which SPARK_CONF_DIR is directly assigned with SPARK_HOME/conf. It is inconvenient for those who define SPARK_CONF_DIR in env. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3696) Do not override user-defined conf_dir in spark-config.sh
[ https://issues.apache.org/jira/browse/SPARK-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3696. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Target Version/s: 1.1.1, 1.2.0 Do not override user-defined conf_dir in spark-config.sh Key: SPARK-3696 URL: https://issues.apache.org/jira/browse/SPARK-3696 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Priority: Minor Fix For: 1.1.1, 1.2.0 Now many scripts used spark-config.sh in which SPARK_CONF_DIR is directly assigned with SPARK_HOME/conf. It is inconvenient for those who define SPARK_CONF_DIR in env. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1655) In naive Bayes, store conditional probabilities distributively.
[ https://issues.apache.org/jira/browse/SPARK-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1655: - Assignee: Aaron Staple In naive Bayes, store conditional probabilities distributively. --- Key: SPARK-1655 URL: https://issues.apache.org/jira/browse/SPARK-1655 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple In the current implementation, we collect all conditional probabilities to the driver node. When there are many labels and many features, this puts heavy load on the driver. For scalability, we should provide a way to store conditional probabilities distributively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2778) Add unit tests for Yarn integration
[ https://issues.apache.org/jira/browse/SPARK-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-2778. Resolution: Fixed Target Version/s: 1.2.0 Add unit tests for Yarn integration --- Key: SPARK-2778 URL: https://issues.apache.org/jira/browse/SPARK-2778 Project: Spark Issue Type: Test Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.2.0 Attachments: yarn-logs.txt It would be nice to add some Yarn integration tests to the unit tests in Spark; Yarn provides a MiniYARNCluster class that can be used to spawn a cluster locally. UPDATE: These tests are causing exceptions in our nightly build: {code} sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) at org.apache.spark.SparkContext.init(SparkContext.scala:310) at org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at
[jira] [Closed] (SPARK-3710) YARN integration test is flaky
[ https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3710. Resolution: Fixed Fix Version/s: 1.2.0 YARN integration test is flaky -- Key: SPARK-3710 URL: https://issues.apache.org/jira/browse/SPARK-3710 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Patrick Wendell Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.2.0 This has been regularly failing the master build: Example failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink One thing to look at is whether the YARN mini cluster makes assumptions about being able to bind to specific ports. {code} sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) at org.apache.spark.SparkContext.init(SparkContext.scala:310) at org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
[jira] [Commented] (SPARK-3710) YARN integration test is flaky
[ https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158326#comment-14158326 ] Andrew Or commented on SPARK-3710: -- https://github.com/apache/spark/pull/2605 YARN integration test is flaky -- Key: SPARK-3710 URL: https://issues.apache.org/jira/browse/SPARK-3710 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Patrick Wendell Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.2.0 This has been regularly failing the master build: Example failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink One thing to look at is whether the YARN mini cluster makes assumptions about being able to bind to specific ports. {code} sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) at org.apache.spark.SparkContext.init(SparkContext.scala:310) at org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) at
[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158363#comment-14158363 ] Reza Farivar commented on SPARK-3785: - Olivier Chafik who wrote javacl (which you mentioned in your description) also has a beta stage scalacl package on github https://github.com/ochafik/ScalaCL There was also another project trying to get opencl in java: aparapi. The neat thing about aparapi is that it doesn't require you to write opencl kernels in C, but would translate java loops into opencl code on the run. Seems like ScalaCL project has similar goals for scala. Support off-loading computations to a GPU - Key: SPARK-3785 URL: https://issues.apache.org/jira/browse/SPARK-3785 Project: Spark Issue Type: Brainstorming Components: MLlib Reporter: Thomas Darimont Priority: Minor Are there any plans to adding support for off-loading computations to the GPU, e.g. via an open-cl binding? http://www.jocl.org/ https://code.google.com/p/javacl/ http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3710) YARN integration test is flaky
[ https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158364#comment-14158364 ] Marcelo Vanzin commented on SPARK-3710: --- Hmm. For some the e-mail for this bug ended up in my spam box. Anyway, fix was tracked also in SPARK-2778. YARN integration test is flaky -- Key: SPARK-3710 URL: https://issues.apache.org/jira/browse/SPARK-3710 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Patrick Wendell Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.2.0 This has been regularly failing the master build: Example failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink One thing to look at is whether the YARN mini cluster makes assumptions about being able to bind to specific ports. {code} sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) at org.apache.spark.SparkContext.init(SparkContext.scala:310) at org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at
[jira] [Commented] (SPARK-3710) YARN integration test is flaky
[ https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158376#comment-14158376 ] Marcelo Vanzin commented on SPARK-3710: --- I filed a Yarn bug (YARN-2642), although we can't get rid of the workaround since we need to support existing versions of Yarn. YARN integration test is flaky -- Key: SPARK-3710 URL: https://issues.apache.org/jira/browse/SPARK-3710 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Patrick Wendell Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.2.0 This has been regularly failing the master build: Example failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink One thing to look at is whether the YARN mini cluster makes assumptions about being able to bind to specific ports. {code} sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69) at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) at org.apache.spark.SparkContext.init(SparkContext.scala:310) at org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
[jira] [Resolved] (SPARK-3007) Add Dynamic Partition support to Spark Sql hive
[ https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3007. Resolution: Fixed Okay, this was merged again: https://github.com/apache/spark/pull/2616 Add Dynamic Partition support to Spark Sql hive --- Key: SPARK-3007 URL: https://issues.apache.org/jira/browse/SPARK-3007 Project: Spark Issue Type: Improvement Components: SQL Reporter: baishuo Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3212) Improve the clarity of caching semantics
[ https://issues.apache.org/jira/browse/SPARK-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3212. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2501 [https://github.com/apache/spark/pull/2501] Improve the clarity of caching semantics Key: SPARK-3212 URL: https://issues.apache.org/jira/browse/SPARK-3212 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 Right now there are a bunch of different ways to cache tables in Spark SQL. For example: - tweets.cache() - sql(SELECT * FROM tweets).cache() - table(tweets).cache() - tweets.cache().registerTempTable(tweets) - sql(CACHE TABLE tweets) - cacheTable(tweets) Each of the above commands has subtly different semantics, leading to a very confusing user experience. Ideally, we would stop doing caching based on simple tables names and instead have a phase of optimization that does intelligent matching of query plans with available cached data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1379) Calling .cache() on a SchemaRDD should do something more efficient than caching the individual row objects.
[ https://issues.apache.org/jira/browse/SPARK-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1379. - Resolution: Fixed Calling .cache() on a SchemaRDD should do something more efficient than caching the individual row objects. --- Key: SPARK-1379 URL: https://issues.apache.org/jira/browse/SPARK-1379 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Since rows aren't black boxes we could use InMemoryColumnarTableScan. This would significantly reduce GC pressure on the workers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3641) Correctly populate SparkPlan.currentContext
[ https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3641. - Resolution: Fixed Assignee: Michael Armbrust (was: Yin Huai) Correctly populate SparkPlan.currentContext --- Key: SPARK-3641 URL: https://issues.apache.org/jira/browse/SPARK-3641 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yin Huai Assignee: Michael Armbrust Priority: Critical After creating a new SQLContext, we need to populate SparkPlan.currentContext before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD populate SparkPlan.currentContext. SQLContext.applySchema is missing this call and we can have NPE as described in http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1671) Cached tables should follow write-through policy
[ https://issues.apache.org/jira/browse/SPARK-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1671. - Resolution: Fixed I'm gonna mark this as resolved now that we do at least invalidate the cache when writing through. We can create a follow up JIRA for partial invalidation if we want. Cached tables should follow write-through policy Key: SPARK-1671 URL: https://issues.apache.org/jira/browse/SPARK-1671 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian Assignee: Michael Armbrust Labels: cache, column Writing (insert / load) to a cached table causes cache inconsistency, and user have to unpersist and cache the whole table again. The write-through policy may be implemented with {{RDD.union}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2973) Add a way to show tables without executing a job
[ https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2973: Assignee: Cheng Lian (was: Michael Armbrust) Add a way to show tables without executing a job Key: SPARK-2973 URL: https://issues.apache.org/jira/browse/SPARK-2973 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 Right now, sql(show tables).collect() will start a Spark job which shows up in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3535. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Assignee: Brenden Matthews Fix For: 1.1.1, 1.2.0 Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3535: - Assignee: Brenden Matthews Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Assignee: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3775) Not suitable error message in spark-shell.cmd
[ https://issues.apache.org/jira/browse/SPARK-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3775: - Affects Version/s: 1.1.0 Not suitable error message in spark-shell.cmd - Key: SPARK-3775 URL: https://issues.apache.org/jira/browse/SPARK-3775 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Priority: Trivial In Windows environment. When we execute bin\spark-shell.cmd before we build spark, we get the error message like this. {quote} Failed to find Spark assembly JAR. You need to build Spark with sbt\sbt assembly before running this program. {quote} But this message is not suitable because ... * Maven is also available to build Spark, and it works in Windows without cygwin now ([SPARK-3061]). * The equivalent error message of linux version (bin/spark-shell) doesn't mention the way to build. bq. You need to build Spark before running this program. * sbt\sbt can't be executed in Windows without cygwin because it's bash script. So this message should be modified as same as the linux version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3775) Not suitable error message in spark-shell.cmd
[ https://issues.apache.org/jira/browse/SPARK-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3775. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: Masayoshi TSUZUKI Target Version/s: 1.1.1, 1.2.0 Not suitable error message in spark-shell.cmd - Key: SPARK-3775 URL: https://issues.apache.org/jira/browse/SPARK-3775 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Priority: Trivial Fix For: 1.1.1, 1.2.0 In Windows environment. When we execute bin\spark-shell.cmd before we build spark, we get the error message like this. {quote} Failed to find Spark assembly JAR. You need to build Spark with sbt\sbt assembly before running this program. {quote} But this message is not suitable because ... * Maven is also available to build Spark, and it works in Windows without cygwin now ([SPARK-3061]). * The equivalent error message of linux version (bin/spark-shell) doesn't mention the way to build. bq. You need to build Spark before running this program. * sbt\sbt can't be executed in Windows without cygwin because it's bash script. So this message should be modified as same as the linux version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3774) typo comment in bin/utils.sh
[ https://issues.apache.org/jira/browse/SPARK-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3774. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: Masayoshi TSUZUKI Target Version/s: 1.1.1, 1.2.0 typo comment in bin/utils.sh Key: SPARK-3774 URL: https://issues.apache.org/jira/browse/SPARK-3774 Project: Spark Issue Type: Improvement Components: PySpark, Spark Shell Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Priority: Trivial Fix For: 1.1.1, 1.2.0 typo comment in bin/utils.sh {code} # Gather all all spark-submit options into SUBMISSION_OPTS {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.
[ https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3606: - Fix Version/s: 1.2.0 Spark-on-Yarn AmIpFilter does not work with Yarn HA. Key: SPARK-3606 URL: https://issues.apache.org/jira/browse/SPARK-3606 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.2.0 The current IP filter only considers one of the RMs in an HA setup. If the active RM is not the configured one, you get a connection refused error when clicking on the Spark AM links in the RM UI. Similar to YARN-1811, but for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.
[ https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3606: - Target Version/s: 1.1.1, 1.2.0 Affects Version/s: (was: 1.2.0) Spark-on-Yarn AmIpFilter does not work with Yarn HA. Key: SPARK-3606 URL: https://issues.apache.org/jira/browse/SPARK-3606 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.2.0 The current IP filter only considers one of the RMs in an HA setup. If the active RM is not the configured one, you get a connection refused error when clicking on the Spark AM links in the RM UI. Similar to YARN-1811, but for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.
[ https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3606: - Affects Version/s: 1.2.0 Spark-on-Yarn AmIpFilter does not work with Yarn HA. Key: SPARK-3606 URL: https://issues.apache.org/jira/browse/SPARK-3606 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.2.0 The current IP filter only considers one of the RMs in an HA setup. If the active RM is not the configured one, you get a connection refused error when clicking on the Spark AM links in the RM UI. Similar to YARN-1811, but for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3763) The example of building with sbt should be sbt assembly instead of sbt compile
[ https://issues.apache.org/jira/browse/SPARK-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3763. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Kousuke Saruta The example of building with sbt should be sbt assembly instead of sbt compile -- Key: SPARK-3763 URL: https://issues.apache.org/jira/browse/SPARK-3763 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Trivial Fix For: 1.2.0 In building-spark.md, there are some examples for making assembled package with maven but the example for building with sbt is only about for compiling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-1860. --- Resolution: Fixed Fixed by mccheah in https://github.com/apache/spark/pull/2609 Standalone Worker cleanup should not clean up running executors --- Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Blocker The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any executors that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Executor's log/data folders should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3786) Speedup tests of PySpark
[ https://issues.apache.org/jira/browse/SPARK-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158552#comment-14158552 ] Apache Spark commented on SPARK-3786: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2646 Speedup tests of PySpark Key: SPARK-3786 URL: https://issues.apache.org/jira/browse/SPARK-3786 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu It takes about 20 minutes (about 25% of all the tests) to run all the tests of PySpark. The slowest ones are tests.py and streaming/tests.py, they create new JVM and SparkContext for each test cases, it will be faster to reuse the SparkContext for most of cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158571#comment-14158571 ] Sandy Ryza commented on SPARK-3561: --- I think there may be somewhat of a misunderstanding about the relationship between Spark and YARN. YARN is not an execution environment, but a cluster resource manager that has the ability to start processes on behalf of execution engines like Spark. Spark already supports YARN as a cluster resource manager, but YARN doesn't provide its own execution engine. YARN doesn't provide a stateless shuffle (although execution engines built atop it like MR and Tez do). If I understand, the broader intent is to decouple the Spark API from the execution engine it runs on top of. Changing the title to reflect this. That, the Spark API is currently very tightly integrated with its execution engine, and frankly, decoupling the two so that Spark would be able to run on top of execution engines with similar properties seems more trouble than its worth. Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
Kousuke Saruta created SPARK-3787: - Summary: Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version Key: SPARK-3787 URL: https://issues.apache.org/jira/browse/SPARK-3787 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Kousuke Saruta When we build with sbt with profile for hadoop and without property for hadoop version like: {code} sbt/sbt -Phadoop-2.2 assembly {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
[ https://issues.apache.org/jira/browse/SPARK-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3787: -- Description: When we build with sbt with profile for hadoop and without property for hadoop version like: {code} sbt/sbt -Phadoop-2.2 assembly {code} jar name is always used default version (1.0.4). When we build with maven with same condition for sbt, default version for each profile. For instance, if we build like: {code} mvn -Phadoop-2.2 package {code} jar name is used hadoop2.2.0 as a default version of hadoop-2.2. was: When we build with sbt with profile for hadoop and without property for hadoop version like: {code} sbt/sbt -Phadoop-2.2 assembly {code} Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version --- Key: SPARK-3787 URL: https://issues.apache.org/jira/browse/SPARK-3787 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Kousuke Saruta When we build with sbt with profile for hadoop and without property for hadoop version like: {code} sbt/sbt -Phadoop-2.2 assembly {code} jar name is always used default version (1.0.4). When we build with maven with same condition for sbt, default version for each profile. For instance, if we build like: {code} mvn -Phadoop-2.2 package {code} jar name is used hadoop2.2.0 as a default version of hadoop-2.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
[ https://issues.apache.org/jira/browse/SPARK-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158602#comment-14158602 ] Apache Spark commented on SPARK-3787: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2647 Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version --- Key: SPARK-3787 URL: https://issues.apache.org/jira/browse/SPARK-3787 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Kousuke Saruta When we build with sbt with profile for hadoop and without property for hadoop version like: {code} sbt/sbt -Phadoop-2.2 assembly {code} jar name is always used default version (1.0.4). When we build with maven with same condition for sbt, default version for each profile. For instance, if we build like: {code} mvn -Phadoop-2.2 package {code} jar name is used hadoop2.2.0 as a default version of hadoop-2.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
[ https://issues.apache.org/jira/browse/SPARK-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3787: -- Description: When we build with sbt with profile for hadoop and without property for hadoop version like: {code} sbt/sbt -Phadoop-2.2 assembly {code} jar name is always used default version (1.0.4). When we build with maven with same condition for sbt, default version for each profile is used. For instance, if we build like: {code} mvn -Phadoop-2.2 package {code} jar name is used hadoop2.2.0 as a default version of hadoop-2.2. was: When we build with sbt with profile for hadoop and without property for hadoop version like: {code} sbt/sbt -Phadoop-2.2 assembly {code} jar name is always used default version (1.0.4). When we build with maven with same condition for sbt, default version for each profile. For instance, if we build like: {code} mvn -Phadoop-2.2 package {code} jar name is used hadoop2.2.0 as a default version of hadoop-2.2. Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version --- Key: SPARK-3787 URL: https://issues.apache.org/jira/browse/SPARK-3787 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Kousuke Saruta When we build with sbt with profile for hadoop and without property for hadoop version like: {code} sbt/sbt -Phadoop-2.2 assembly {code} jar name is always used default version (1.0.4). When we build with maven with same condition for sbt, default version for each profile is used. For instance, if we build like: {code} mvn -Phadoop-2.2 package {code} jar name is used hadoop2.2.0 as a default version of hadoop-2.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Decouple Spark's API from its execution engine
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158614#comment-14158614 ] Oleg Zhurakousky commented on SPARK-3561: - [~sandyr] Indeed YARN is a _resource manager_ that supports multiple execution environments by helping with resource allocation and management. On the other hand, Spark, Tez and many other (custom) execution environments are currently run on YARN. (NOTE: Custom execution environments on YARN are becoming very common in large enterprises). Such decoupling will ensure that Spark can integrate with any and all (where applicable) in a pluggable and extensible fashion. Decouple Spark's API from its execution engine -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Decouple Spark's API from its execution engine
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158614#comment-14158614 ] Oleg Zhurakousky edited comment on SPARK-3561 at 10/3/14 10:34 PM: --- [~sandyr] Indeed YARN is a _resource manager_ that supports multiple execution environments by facilitating resource allocation and management. On the other hand, Spark, Tez and many other (custom) execution environments are currently run on YARN. (NOTE: Custom execution environments on YARN are becoming very common in large enterprises). Such decoupling will ensure that Spark can integrate with any and all (where applicable) in a pluggable and extensible fashion. was (Author: ozhurakousky): [~sandyr] Indeed YARN is a _resource manager_ that supports multiple execution environments by helping with resource allocation and management. On the other hand, Spark, Tez and many other (custom) execution environments are currently run on YARN. (NOTE: Custom execution environments on YARN are becoming very common in large enterprises). Such decoupling will ensure that Spark can integrate with any and all (where applicable) in a pluggable and extensible fashion. Decouple Spark's API from its execution engine -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Decouple Spark's API from its execution engine
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-3561: -- Description: Currently Spark's API is tightly coupled with its backend execution engine. It could be useful to provide a point of pluggability between the two to allow Spark to run on other DAG execution engines with similar distributed memory abstractions. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well Decouple Spark's API from its execution engine -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark's API is tightly coupled with its backend execution engine. It could be useful to provide a point of pluggability between the two to allow Spark to run on other DAG execution engines with similar distributed memory abstractions. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Decouple Spark's API from its execution engine
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-3561: -- Description: Currently Spark's API is tightly coupled with its backend execution engine. It could be useful to provide a point of pluggability between the two to allow Spark to run on other DAG execution engines with similar distributed memory abstractions. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well was: Currently Spark's API is tightly coupled with its backend execution engine. It could be useful to provide a point of pluggability between the two to allow Spark to run on other DAG execution engines with similar distributed memory abstractions. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well Decouple Spark's API from its execution engine -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark's API is tightly coupled with its backend execution engine. It could be useful to provide a point of pluggability between the two to allow Spark to run on other DAG execution engines with similar distributed memory abstractions. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Decouple Spark's API from its execution engine
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158571#comment-14158571 ] Sandy Ryza edited comment on SPARK-3561 at 10/3/14 11:00 PM: - I think there may be somewhat of a misunderstanding about the relationship between Spark and YARN. YARN is not an execution environment, but a cluster resource manager that has the ability to start processes on behalf of execution engines like Spark. Spark already supports YARN as a cluster resource manager, but YARN doesn't provide its own execution engine. YARN doesn't provide a stateless shuffle (although execution engines built atop it like MR and Tez do). If I understand, the broader intent is to decouple the Spark API from the execution engine it runs on top of. Changing the title to reflect this. That said, the Spark API is currently very tightly integrated with its execution engine, and frankly, decoupling the two so that Spark would be able to run on top of execution engines with similar properties seems more trouble than its worth. was (Author: sandyr): I think there may be somewhat of a misunderstanding about the relationship between Spark and YARN. YARN is not an execution environment, but a cluster resource manager that has the ability to start processes on behalf of execution engines like Spark. Spark already supports YARN as a cluster resource manager, but YARN doesn't provide its own execution engine. YARN doesn't provide a stateless shuffle (although execution engines built atop it like MR and Tez do). If I understand, the broader intent is to decouple the Spark API from the execution engine it runs on top of. Changing the title to reflect this. That, the Spark API is currently very tightly integrated with its execution engine, and frankly, decoupling the two so that Spark would be able to run on top of execution engines with similar properties seems more trouble than its worth. Decouple Spark's API from its execution engine -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark's user-facing API is tightly coupled with its backend execution engine. It could be useful to provide a point of pluggability between the two to allow Spark to run on other DAG execution engines with similar distributed memory abstractions. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation
Marcelo Vanzin created SPARK-3788: - Summary: Yarn dist cache code is not friendly to HDFS HA, Federation Key: SPARK-3788 URL: https://issues.apache.org/jira/browse/SPARK-3788 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin There are two bugs here. 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the URI to be an actual host. In the case of HA and Federation, that's a namespace name, which doesn't resolve to anything. So in those cases, {{compareFs()}} always says the file systems are different. 2. In {{prepareLocalResources()}}, when adding a file to the distributed cache, that is done with the common FileSystem object instantiated at the start of the method. In the case of Federation that doesn't work: the qualified URL's scheme may differ from the non-qualified one, so the FileSystem instance will not work. Fixes are pretty trivial. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158646#comment-14158646 ] Andrew Ash commented on SPARK-1860: --- [~ilikerps] this ticket mentioned turning the cleanup code on by default once this ticket was fixed. Should we change the defaults to have this on by default? Standalone Worker cleanup should not clean up running executors --- Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Blocker The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any executors that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Executor's log/data folders should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation
[ https://issues.apache.org/jira/browse/SPARK-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158657#comment-14158657 ] Marcelo Vanzin commented on SPARK-3788: --- Note: 2 above only applies to branch-1.1. It was fixed in master by https://github.com/apache/spark/commit/c4022dd5. Yarn dist cache code is not friendly to HDFS HA, Federation --- Key: SPARK-3788 URL: https://issues.apache.org/jira/browse/SPARK-3788 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin There are two bugs here. 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the URI to be an actual host. In the case of HA and Federation, that's a namespace name, which doesn't resolve to anything. So in those cases, {{compareFs()}} always says the file systems are different. 2. In {{prepareLocalResources()}}, when adding a file to the distributed cache, that is done with the common FileSystem object instantiated at the start of the method. In the case of Federation that doesn't work: the qualified URL's scheme may differ from the non-qualified one, so the FileSystem instance will not work. Fixes are pretty trivial. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation
[ https://issues.apache.org/jira/browse/SPARK-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158710#comment-14158710 ] Marcelo Vanzin commented on SPARK-3788: --- Ah, 2 was fixed in branch-1.1 as part of SPARK-2577. So only issue 1 remains. Yarn dist cache code is not friendly to HDFS HA, Federation --- Key: SPARK-3788 URL: https://issues.apache.org/jira/browse/SPARK-3788 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin There are two bugs here. 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the URI to be an actual host. In the case of HA and Federation, that's a namespace name, which doesn't resolve to anything. So in those cases, {{compareFs()}} always says the file systems are different. 2. In {{prepareLocalResources()}}, when adding a file to the distributed cache, that is done with the common FileSystem object instantiated at the start of the method. In the case of Federation that doesn't work: the qualified URL's scheme may differ from the non-qualified one, so the FileSystem instance will not work. Fixes are pretty trivial. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3789) Python bindings for GraphX
Ameet Talwalkar created SPARK-3789: -- Summary: Python bindings for GraphX Key: SPARK-3789 URL: https://issues.apache.org/jira/browse/SPARK-3789 Project: Spark Issue Type: New Feature Components: GraphX, PySpark Reporter: Ameet Talwalkar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation
[ https://issues.apache.org/jira/browse/SPARK-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158725#comment-14158725 ] Apache Spark commented on SPARK-3788: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/2650 Yarn dist cache code is not friendly to HDFS HA, Federation --- Key: SPARK-3788 URL: https://issues.apache.org/jira/browse/SPARK-3788 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin There are two bugs here. 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the URI to be an actual host. In the case of HA and Federation, that's a namespace name, which doesn't resolve to anything. So in those cases, {{compareFs()}} always says the file systems are different. 2. In {{prepareLocalResources()}}, when adding a file to the distributed cache, that is done with the common FileSystem object instantiated at the start of the method. In the case of Federation that doesn't work: the qualified URL's scheme may differ from the non-qualified one, so the FileSystem instance will not work. Fixes are pretty trivial. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3314) Script creation of AMIs
[ https://issues.apache.org/jira/browse/SPARK-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158777#comment-14158777 ] Nicholas Chammas commented on SPARK-3314: - Hey [~holdenk], I think this is a great issue to work on. There was a related discussion on the dev list about using [Packer|http://www.packer.io/] to do this. I will be looking into this option and will report back here. Script creation of AMIs --- Key: SPARK-3314 URL: https://issues.apache.org/jira/browse/SPARK-3314 Project: Spark Issue Type: Improvement Components: EC2 Reporter: holdenk Priority: Minor The current Spark AMIs have been built up over time. It would be useful to provide a script which can be used to bootstrap from a fresh Amazon AMI. We could also update the AMIs in the project at the same time to be based on a newer version so we don't have to wait so long for the security updates to be installed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number
[ https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158865#comment-14158865 ] Apache Spark commented on SPARK-3772: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/2651 RDD operation on IPython REPL failed with an illegal port number Key: SPARK-3772 URL: https://issues.apache.org/jira/browse/SPARK-3772 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0 Reporter: cocoatomo Labels: pyspark To reproduce this issue, we should execute following commands on the commit: 6e27cb630de69fa5acb510b4e2f6b980742b1957. {quote} $ PYSPARK_PYTHON=ipython ./bin/pyspark ... In [1]: file = sc.textFile('README.md') In [2]: file.first() ... 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) with 1 output partitions (allowLocal=true) 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:334) 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List() 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List() 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44), which has no missing parents 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with curMem=57388, maxMem=278019440 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 265.1 MB) 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44) 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1207 bytes) 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: port out of range:1027423549 at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) at java.net.InetSocketAddress.init(InetSocketAddress.java:188) at java.net.Socket.init(Socket.java:244) at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158949#comment-14158949 ] Reza Zadeh commented on SPARK-3434: --- Any updates Shivaraman? Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org