[jira] [Created] (SPARK-1678) Compression loses repeated values.
Michael Armbrust created SPARK-1678: --- Summary: Compression loses repeated values. Key: SPARK-1678 URL: https://issues.apache.org/jira/browse/SPARK-1678 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.0.0 Here's a test case: {code} test(all the same strings) { sparkContext.parallelize(1 to 1000).map(_ = StringData(test)).registerAsTable(test1000) assert(sql(SELECT * FROM test1000).count() === 1000) cacheTable(test1000) assert(sql(SELECT * FROM test1000).count() === 1000) } {code} First assert passes, second one fails. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1004) PySpark on YARN
[ https://issues.apache.org/jira/browse/SPARK-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1004. Resolution: Fixed Issue resolved by pull request 30 [https://github.com/apache/spark/pull/30] PySpark on YARN --- Key: SPARK-1004 URL: https://issues.apache.org/jira/browse/SPARK-1004 Project: Spark Issue Type: Sub-task Components: PySpark Reporter: Josh Rosen Assignee: Sandy Ryza Priority: Blocker Fix For: 1.0.0 This is for tracking progress on supporting YARN in PySpark. We might be able to use {{yarn-client}} mode (https://spark.incubator.apache.org/docs/latest/running-on-yarn.html#launch-spark-application-with-yarn-client-mode). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1626) Update Spark YARN docs to use spark-submit
[ https://issues.apache.org/jira/browse/SPARK-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1626. Resolution: Duplicate Update Spark YARN docs to use spark-submit -- Key: SPARK-1626 URL: https://issues.apache.org/jira/browse/SPARK-1626 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Reporter: Patrick Wendell Assignee: Sandy Ryza Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1492) running-on-yarn doc should use spark-submit script for examples
[ https://issues.apache.org/jira/browse/SPARK-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1492: --- Priority: Blocker (was: Major) running-on-yarn doc should use spark-submit script for examples --- Key: SPARK-1492 URL: https://issues.apache.org/jira/browse/SPARK-1492 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Sandy Ryza Priority: Blocker the spark-class script puts out lots of warnings telling users to use spark-submit script with new options. We should update the running-on-yarn.md docs to have examples using the spark-submit script rather then spark-class. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1466) Pyspark doesn't check if gateway process launches correctly
[ https://issues.apache.org/jira/browse/SPARK-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1466: --- Fix Version/s: (was: 1.0.0) 1.0.1 Pyspark doesn't check if gateway process launches correctly --- Key: SPARK-1466 URL: https://issues.apache.org/jira/browse/SPARK-1466 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0, 0.9.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Blocker Fix For: 1.0.1 If the gateway process fails to start correctly (e.g., because JAVA_HOME isn't set correctly, there's no Spark jar, etc.), right now pyspark fails because of a very difficult-to-understand error, where we try to parse stdout to get the port where Spark started and there's nothing there. We should properly catch the error, print it to the user, and exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-922: -- Fix Version/s: (was: 1.0.0) 1.1.0 Update Spark AMI to Python 2.7 -- Key: SPARK-922 URL: https://issues.apache.org/jira/browse/SPARK-922 Project: Spark Issue Type: Task Components: EC2, PySpark Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Josh Rosen Priority: Blocker Fix For: 1.1.0 Many Python libraries only support Python 2.7+, so we should make Python 2.7 the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-922: -- Priority: Major (was: Blocker) Update Spark AMI to Python 2.7 -- Key: SPARK-922 URL: https://issues.apache.org/jira/browse/SPARK-922 Project: Spark Issue Type: Task Components: EC2, PySpark Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Josh Rosen Fix For: 1.1.0 Many Python libraries only support Python 2.7+, so we should make Python 2.7 the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985219#comment-13985219 ] Patrick Wendell commented on SPARK-922: --- This is no longer a blocker now that we've downgraded the python dependency, but would still be nice to have. Update Spark AMI to Python 2.7 -- Key: SPARK-922 URL: https://issues.apache.org/jira/browse/SPARK-922 Project: Spark Issue Type: Task Components: EC2, PySpark Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Josh Rosen Fix For: 1.1.0 Many Python libraries only support Python 2.7+, so we should make Python 2.7 the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1682) LogisticRegressionWithSGD should support svmlight data and gradient descent w/o sampling
Dong Wang created SPARK-1682: Summary: LogisticRegressionWithSGD should support svmlight data and gradient descent w/o sampling Key: SPARK-1682 URL: https://issues.apache.org/jira/browse/SPARK-1682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Dong Wang Fix For: 1.0.0 The LogisticRegressionWithSGD example does not expose the following capability that already exist inside MLlib: * reading svmlight data * regularization with l1 and l2 * add intercept * write model to a file * read model and generate predictions The GradientDescent optimizer does sampling before a gradient step. When input data is already shuffled beforehand, it is possible to scan data and make gradient descent for each data instance. This could be potentially more efficient. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1644) The org.datanucleus:* should not be packaged into spark-assembly-*.jar
[ https://issues.apache.org/jira/browse/SPARK-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1644: --- Summary: The org.datanucleus:* should not be packaged into spark-assembly-*.jar (was: hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) throw an exception) The org.datanucleus:* should not be packaged into spark-assembly-*.jar --- Key: SPARK-1644 URL: https://issues.apache.org/jira/browse/SPARK-1644 Project: Spark Issue Type: Bug Components: SQL Reporter: Guoqiang Li Assignee: Guoqiang Li Attachments: spark.log cat conf/hive-site.xml {code:xml} configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:postgresql://bj-java-hugedata1:7432/hive/value /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.postgresql.Driver/value /property property namejavax.jdo.option.ConnectionUserName/name valuehive/value /property property namejavax.jdo.option.ConnectionPassword/name valuepasswd/value /property property namehive.metastore.local/name valuefalse/value /property property namehive.metastore.warehouse.dir/name valuehdfs://host:8020/user/hive/warehouse/value /property /configuration {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1629) Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS`
[ https://issues.apache.org/jira/browse/SPARK-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985352#comment-13985352 ] Guoqiang Li commented on SPARK-1629: [The PR 569|https://github.com/apache/spark/pull/569] Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS` Key: SPARK-1629 URL: https://issues.apache.org/jira/browse/SPARK-1629 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Minor Right now we use this but don't depend on it explicitly (which is wrong). We should probably just inline this function and remove the need to add a dependency. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1629) Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS`
[ https://issues.apache.org/jira/browse/SPARK-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1629. Resolution: Fixed Fix Version/s: 1.0.0 Thanks for this fix Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS` Key: SPARK-1629 URL: https://issues.apache.org/jira/browse/SPARK-1629 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Minor Fix For: 1.0.0 Right now we use this but don't depend on it explicitly (which is wrong). We should probably just inline this function and remove the need to add a dependency. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1683) Display filesystem read statistics with each task
Patrick Wendell created SPARK-1683: -- Summary: Display filesystem read statistics with each task Key: SPARK-1683 URL: https://issues.apache.org/jira/browse/SPARK-1683 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Kay Ousterhout Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1492) running-on-yarn doc should use spark-submit script for examples
[ https://issues.apache.org/jira/browse/SPARK-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985750#comment-13985750 ] Sandy Ryza commented on SPARK-1492: --- https://github.com/apache/spark/pull/601 running-on-yarn doc should use spark-submit script for examples --- Key: SPARK-1492 URL: https://issues.apache.org/jira/browse/SPARK-1492 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Sandy Ryza Priority: Blocker the spark-class script puts out lots of warnings telling users to use spark-submit script with new options. We should update the running-on-yarn.md docs to have examples using the spark-submit script rather then spark-class. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1684) Merge script should standardize SPARK-XXX prefix
Patrick Wendell created SPARK-1684: -- Summary: Merge script should standardize SPARK-XXX prefix Key: SPARK-1684 URL: https://issues.apache.org/jira/browse/SPARK-1684 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Minor Fix For: 1.1.0 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK XXX: Issue -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1684) Merge script should standardize SPARK-XXX prefix
[ https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985770#comment-13985770 ] Sean Owen commented on SPARK-1684: -- (Can I be pedantic and suggest standardizing to SPARK-XXX ? this is how issues are reported in other projects, like HIVE-123 and MAPREDUCE-234) Merge script should standardize SPARK-XXX prefix Key: SPARK-1684 URL: https://issues.apache.org/jira/browse/SPARK-1684 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Minor Fix For: 1.1.0 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK XXX: Issue -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1683) Display filesystem read statistics with each task
[ https://issues.apache.org/jira/browse/SPARK-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985785#comment-13985785 ] Kay Ousterhout commented on SPARK-1683: --- I've already done this and will submit a PR this weekend Display filesystem read statistics with each task - Key: SPARK-1683 URL: https://issues.apache.org/jira/browse/SPARK-1683 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Kay Ousterhout Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1684) Merge script should standardize SPARK-XXX prefix
[ https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1684: --- Description: If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK-XXX: Issue (was: If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK XXX: Issue) Merge script should standardize SPARK-XXX prefix Key: SPARK-1684 URL: https://issues.apache.org/jira/browse/SPARK-1684 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Minor Fix For: 1.1.0 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK-XXX: Issue -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1684) Merge script should standardize SPARK-XXX prefix
[ https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985796#comment-13985796 ] Patrick Wendell commented on SPARK-1684: Just a typo in the description! SPARK-XXX is the correct format ala: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Merge script should standardize SPARK-XXX prefix Key: SPARK-1684 URL: https://issues.apache.org/jira/browse/SPARK-1684 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Minor Fix For: 1.1.0 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK-XXX: Issue -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient
Mark Hamstra created SPARK-1685: --- Summary: retryTimer not canceled on actor restart in Worker and AppClient Key: SPARK-1685 URL: https://issues.apache.org/jira/browse/SPARK-1685 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Mark Hamstra Assignee: Mark Hamstra Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actors restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor will not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985880#comment-13985880 ] Diana Carroll commented on SPARK-823: - Yes, please clarify the documentation, I just ran into this. the Configuration guide (http://spark.apache.org/docs/latest/configuration.html) says the default is 8. In testing this on Standalone Spark, there actually is no default value for the variable: sc.getConf.contains(spark.default.parallelism) res1: Boolean = false It looks like if the variable is not set, then the default behavior is decided in code, e.g. Partitioner.scala: {code} if (rdd.context.conf.contains(spark.default.parallelism)) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.size) } {code} spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 0.8.0, 0.7.3 Reporter: Josh Rosen Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985934#comment-13985934 ] Diana Carroll commented on SPARK-823: - Okay, this is definitely more than a documentation bug, because PySpark and Scala work differently if spark.default.parallelism isn't set by the user. I'm testing using wordcount. Pyspark: reduceByKey will use the value of sc.defaultParallelism. That value is set to the number of threads when running locally. On my Spark Standalone cluster which has a single node with a single core, the value is 2. If I set spark.default.parallelism, it will set sc.defaultParallelism to that value and use that. Scala: reduceByKey will use the number of partitions in my file/map stage and ignore the value of sc.defaultParallelism. sc.defaultParallism is set by the same logic as pyspark (number of threads for local, 2 for my microcluster), it is just ignored. I'm not sure which approach is correct. Scala works as described here: http://spark.apache.org/docs/latest/tuning.html {quote} Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config property spark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. {quote} spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 0.8.0, 0.7.3 Reporter: Josh Rosen Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll updated SPARK-823: Component/s: PySpark spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, PySpark, Spark Core Affects Versions: 0.8.0, 0.7.3, 0.9.1 Reporter: Josh Rosen Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends
[ https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll updated SPARK-823: Affects Version/s: 0.9.1 spark.default.parallelism's default is inconsistent across scheduler backends - Key: SPARK-823 URL: https://issues.apache.org/jira/browse/SPARK-823 Project: Spark Issue Type: Bug Components: Documentation, PySpark, Spark Core Affects Versions: 0.8.0, 0.7.3, 0.9.1 Reporter: Josh Rosen Priority: Minor The [0.7.3 configuration guide|http://spark-project.org/docs/latest/configuration.html] says that {{spark.default.parallelism}}'s default is 8, but the default is actually max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos scheduler, and {{threads}} for the local scheduler: https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150 Should this be clarified in the documentation? Should the Mesos scheduler backend's default be revised? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient
[ https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-1685: Description: Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actor restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor will not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. was: Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actors restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor will not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. retryTimer not canceled on actor restart in Worker and AppClient Key: SPARK-1685 URL: https://issues.apache.org/jira/browse/SPARK-1685 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Mark Hamstra Assignee: Mark Hamstra Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actor restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor will not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient
[ https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-1685: Description: Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actor restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor may not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. was: Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actor restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor will not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. retryTimer not canceled on actor restart in Worker and AppClient Key: SPARK-1685 URL: https://issues.apache.org/jira/browse/SPARK-1685 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Mark Hamstra Assignee: Mark Hamstra Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actor restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor may not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-1620) Uncaught exception from Akka scheduler
[ https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra reopened SPARK-1620: - On further investigation, it looks like there really is a problem with exceptions thrown by scheduled functions not being caught by any uncaught exception handler. Uncaught exception from Akka scheduler -- Key: SPARK-1620 URL: https://issues.apache.org/jira/browse/SPARK-1620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Mark Hamstra Priority: Blocker I've been looking at this one in the context of a BlockManagerMaster that OOMs and doesn't respond to heartBeat(), but I suspect that there may be problems elsewhere where we use Akka's scheduler. The basic nature of the problem is that we are expecting exceptions thrown from a scheduled function to be caught in the thread where _ActorSystem_.scheduler.schedule() or scheduleOnce() has been called. In fact, the scheduled function runs on its own thread, so any exceptions that it throws are not caught in the thread that called schedule() -- e.g., unanswered BlockManager heartBeats (scheduled in BlockManager#initialize) that end up throwing exceptions in BlockManagerMaster#askDriverWithReply do not cause those exceptions to be handled by the Executor thread's UncaughtExceptionHandler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1686) Master switches thread when ElectedLeader
Mark Hamstra created SPARK-1686: --- Summary: Master switches thread when ElectedLeader Key: SPARK-1686 URL: https://issues.apache.org/jira/browse/SPARK-1686 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Mark Hamstra In deploy.master.Master, the completeRecovery method is the last thing to be called when a standalone Master is recovering from failure. It is responsible for resetting some state, relaunching drivers, and eventually resuming its scheduling duties. There are currently four places in Master.scala where completeRecovery is called. Three of them are from within the actor's receive method, and aren't problems. The last starts from within receive when the ElectedLeader message is received, but the actual completeRecovery() call is made from the Akka scheduler. That means that it will execute on a different scheduler thread, and Master itself will end up running (i.e., schedule() ) from that Akka scheduler thread. Among other things, that means that uncaught exception handling will be different -- https://issues.apache.org/jira/browse/SPARK-1620 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1687) Support NamedTuples in RDDs
Pat McDonough created SPARK-1687: Summary: Support NamedTuples in RDDs Key: SPARK-1687 URL: https://issues.apache.org/jira/browse/SPARK-1687 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.0 Environment: Spark version 1.0.0-SNAPSHOT Python 2.7.5 Reporter: Pat McDonough Add Support for NamedTuples in RDDs. Some sample code is below, followed by the current error that comes from it. Based on a quick conversation with [~ahirreddy], [Dill|https://github.com/uqfoundation/dill] might be a good solution here. {code} In [26]: from collections import namedtuple ... In [33]: Person = namedtuple('Person', 'id firstName lastName') In [34]: jon = Person(1, Jon, Doe) In [35]: jane = Person(2, Jane, Doe) In [36]: theDoes = sc.parallelize((jon, jane)) In [37]: theDoes.collect() Out[37]: [Person(id=1, firstName='Jon', lastName='Doe'), Person(id=2, firstName='Jane', lastName='Doe')] In [38]: theDoes.count() PySpark worker failed with exception: PySpark worker failed with exception: Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield self._read_with_length(stream) File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in _read_with_length return self.loads(obj) AttributeError: 'module' object has no attribute 'Person' Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield self._read_with_length(stream) File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in _read_with_length return self.loads(obj) AttributeError: 'module' object has no attribute 'Person' 14/04/30 14:43:53 ERROR Executor: Exception in task ID 23 org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func def func(s, iterator): return f(iterator) File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in load_stream yield self._read_with_length(stream) File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in _read_with_length return self.loads(obj) AttributeError: 'module' object has no attribute 'Person' at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:214) at
[jira] [Commented] (SPARK-1361) DAGScheduler throws NullPointerException occasionally
[ https://issues.apache.org/jira/browse/SPARK-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986174#comment-13986174 ] Kousuke Saruta commented on SPARK-1361: --- Hi [~scwf], are there any update for this issue? Do you mind my taking over addressing this issue? DAGScheduler throws NullPointerException occasionally - Key: SPARK-1361 URL: https://issues.apache.org/jira/browse/SPARK-1361 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: wangfei Fix For: 1.0.0 DAGScheduler throws this NullPointerException below occasionally when running spark apps. java.lang.NullPointerException at org.apache.spark.scheduler.DAGScheduler.executorAdded(DAGScheduler.scala:186) at org.apache.spark.scheduler.TaskSchedulerImpl.executorAdded(TaskSchedulerImpl.scala:409) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:210) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:206) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:206) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:130) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrEls -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1620) Uncaught exception from Akka scheduler
[ https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986180#comment-13986180 ] Mark Hamstra commented on SPARK-1620: - And one last instance: In scheduler.TaskSchedulerImpl#start(), checkSpeculatableTasks() is scheduled to be called every SPECULATION_INTERVAL. If checkSpeculatableTasks() throws an exception, that exception will not be caught and no more invocations of checkSpeculatableTasks() will occur. Uncaught exception from Akka scheduler -- Key: SPARK-1620 URL: https://issues.apache.org/jira/browse/SPARK-1620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Mark Hamstra Priority: Blocker I've been looking at this one in the context of a BlockManagerMaster that OOMs and doesn't respond to heartBeat(), but I suspect that there may be problems elsewhere where we use Akka's scheduler. The basic nature of the problem is that we are expecting exceptions thrown from a scheduled function to be caught in the thread where _ActorSystem_.scheduler.schedule() or scheduleOnce() has been called. In fact, the scheduled function runs on its own thread, so any exceptions that it throws are not caught in the thread that called schedule() -- e.g., unanswered BlockManager heartBeats (scheduled in BlockManager#initialize) that end up throwing exceptions in BlockManagerMaster#askDriverWithReply do not cause those exceptions to be handled by the Executor thread's UncaughtExceptionHandler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986220#comment-13986220 ] Tathagata Das commented on SPARK-1645: -- This makes sense from the integration point of view. Though I wonder from thePOV of Flume's deployment configuration does it make things more complex? Like for example, if someone has a the flume system already setup, in the current situation, the configuration change to add a new sink seems standard and easy. However, in the proposed model, since Flume's data pushing node has to run a sink, how much complicated does this configuration process get? Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1688) PySpark throws unhelpful exception when pyspark cannot be loaded
Andrew Or created SPARK-1688: Summary: PySpark throws unhelpful exception when pyspark cannot be loaded Key: SPARK-1688 URL: https://issues.apache.org/jira/browse/SPARK-1688 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 0.9.1 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1689) AppClient does not respond correctly to RemoveApplication
Aaron Davidson created SPARK-1689: - Summary: AppClient does not respond correctly to RemoveApplication Key: SPARK-1689 URL: https://issues.apache.org/jira/browse/SPARK-1689 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Aaron Davidson Assignee: Aaron Davidson When the Master removes an application (usually due to too many executor failures), it means no future executors will be assigned to that app. Currently, the AppClient just marks the application as disconnected, which is intended as a transient state during a period of reconnection. Thus, RemoveApplication just causes the application to enter a state where it has no executors and it doesn't die. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1667) re-fetch fails occasionally
[ https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-1667: -- Summary: re-fetch fails occasionally (was: Should re-fetch when intermediate data for shuffle is lost) re-fetch fails occasionally --- Key: SPARK-1667 URL: https://issues.apache.org/jira/browse/SPARK-1667 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.0.0 Reporter: Kousuke Saruta I met a case that re-fetch wouldn't occur although that should occur. When intermediate data (phisical file of intermediate data on local file system) which is used for shuffle is lost from a Executor, FileNotFoundException was thrown and refetch wouldn't occur. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1690) RDD.saveAsTextFile throws scala.MatchError if RDD contains empty elements
Glenn K. Lockwood created SPARK-1690: Summary: RDD.saveAsTextFile throws scala.MatchError if RDD contains empty elements Key: SPARK-1690 URL: https://issues.apache.org/jira/browse/SPARK-1690 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0 Environment: Linux/CentOS6, Spark 0.9.1, standalone mode against HDFS from Hadoop 1.2.1 Reporter: Glenn K. Lockwood Priority: Minor The following pyspark code fails with a scala.MatchError exception if sample.txt contains any empty lines: file = sc.textFile('hdfs://gcn-3-45.ibnet0:54310/user/glock/sample.txt') file.saveAsTextFile('hdfs://gcn-3-45.ibnet0:54310/user/glock/sample.out') The resulting stack trace: 14/04/30 17:02:46 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/04/30 17:02:46 WARN scheduler.TaskSetManager: Loss was due to scala.MatchError scala.MatchError: 0 (of class java.lang.Integer) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:129) at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119) at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:732) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:741) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:741) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) This can be reproduced with a sample.txt containing foo bar and disappears if sample.txt is foo bar -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986275#comment-13986275 ] Hari Shreedharan commented on SPARK-1645: - Yes, it is better to add new methods rather than reusing the old ones and confusing existing users. In fact, I think we should add a new receiver for the time being and only deprecate the old one initially. We can remove the old one in a later release. Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986312#comment-13986312 ] Tathagata Das commented on SPARK-1645: -- I agree. New receiver for the new API is the best way. Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1678) Compression loses repeated values.
[ https://issues.apache.org/jira/browse/SPARK-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986348#comment-13986348 ] Cheng Lian commented on SPARK-1678: --- Pull request: https://github.com/apache/spark/pull/608 Compression loses repeated values. -- Key: SPARK-1678 URL: https://issues.apache.org/jira/browse/SPARK-1678 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.0.0 Here's a test case: {code} test(all the same strings) { sparkContext.parallelize(1 to 1000).map(_ = StringData(test)).registerAsTable(test1000) assert(sql(SELECT * FROM test1000).count() === 1000) cacheTable(test1000) assert(sql(SELECT * FROM test1000).count() === 1000) } {code} First assert passes, second one fails. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1679) In-Memory compression needs to be configurable.
[ https://issues.apache.org/jira/browse/SPARK-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986347#comment-13986347 ] Cheng Lian commented on SPARK-1679: --- Pull request: https://github.com/apache/spark/pull/608 In-Memory compression needs to be configurable. --- Key: SPARK-1679 URL: https://issues.apache.org/jira/browse/SPARK-1679 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.0.0 Since we are still finding bugs in the compression code I think we should make it configurable in SparkConf and turn it off by default for the 1.0 release. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1691) Support quoted arguments inside of spark-submit
[ https://issues.apache.org/jira/browse/SPARK-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1691: --- Description: Currently due to the way we send arguments on to spark-class, it doesn't work with quoted strings. For instance: {code} ./bin/spark-submit --name My app --spark-driver-extraJavaOptions -Dfoo=x -Dbar=y {code} was:Currently due to the way we send arguments on to spark-class, it doesn't work with quoted strings. Support quoted arguments inside of spark-submit --- Key: SPARK-1691 URL: https://issues.apache.org/jira/browse/SPARK-1691 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.0.0 Currently due to the way we send arguments on to spark-class, it doesn't work with quoted strings. For instance: {code} ./bin/spark-submit --name My app --spark-driver-extraJavaOptions -Dfoo=x -Dbar=y {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1691) Support quoted arguments inside of spark-submit
Patrick Wendell created SPARK-1691: -- Summary: Support quoted arguments inside of spark-submit Key: SPARK-1691 URL: https://issues.apache.org/jira/browse/SPARK-1691 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.0.0 Currently due to the way we send arguments on to spark-class, it doesn't work with quoted strings. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-975) Spark Replay Debugger
[ https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-975: - Attachment: RDD DAG.png Drawback of GraphViz Spark Replay Debugger - Key: SPARK-975 URL: https://issues.apache.org/jira/browse/SPARK-975 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 0.9.0 Reporter: Cheng Lian Labels: arthur, debugger Attachments: RDD DAG.png The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf]. [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur Dave|https://github.com/ankurdave], is an old implementation of the Spark debugger, which demonstrated both the elegance and power behind the RDD abstraction. Unfortunately, the corresponding GitHub branch was not merged into the master branch and had stopped 2 years ago. For more information about Arthur, please refer to [the Spark Debugger Wiki page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub repository. As a useful tool for Spark application debugging and analysis, it would be nice to have a complete Spark debugger. In [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new implementation of the Spark debugger, the Spark Replay Debugger (SRD). [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview for discussion. In the current version, I only implemented features that can illustrate the basic mechanisms. There are still features appeared in Arthur but missing in SRD, such as checksum based nondeterminsm detection and single task debugging with conventional debugger (like {{jdb}}). However, these features can be easily built upon current SRD framework. To minimize code review effort, I didn't include them into the current version intentionally. Attached is the visualization of the MLlib ALS application (with 1 iteration) generated by SRD. For more information, please refer to [the SRD overview document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1692) Enable external sorting in Spark SQL aggregates
Sudip Roy created SPARK-1692: Summary: Enable external sorting in Spark SQL aggregates Key: SPARK-1692 URL: https://issues.apache.org/jira/browse/SPARK-1692 Project: Spark Issue Type: Improvement Reporter: Sudip Roy Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)