[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163921#comment-15163921 ] Jim Lohse commented on SPARK-732: - Affects versions only goes to 1.1.0, presumably this is still an issue? Is it correct that this is only an issue in transformations, but in actions will work correctly? That idea seems to be supported by the docs under https://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka: "In Java, Spark also supports the more general Accumulable interface to accumulate data where the resulting type is not the same as the elements added (e.g. build a list by collecting together elements). For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed." > Recomputation of RDDs may result in duplicated accumulator updates > -- > > Key: SPARK-732 > URL: https://issues.apache.org/jira/browse/SPARK-732 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.6.2, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.8.2, > 0.9.0, 1.0.1, 1.1.0 >Reporter: Josh Rosen >Assignee: Nan Zhu >Priority: Blocker > > Currently, Spark doesn't guard against duplicated updates to the same > accumulator due to recomputations of an RDD. For example: > {code} > val acc = sc.accumulator(0) > data.map(x => acc += 1; f(x)) > data.count() > // acc should equal data.count() here > data.foreach{...} > // Now, acc = 2 * data.count() because the map() was recomputed. > {code} > I think that this behavior is incorrect, especially because this behavior > allows the additon or removal of a cache() call to affect the outcome of a > computation. > There's an old TODO to fix this duplicate update issue in the [DAGScheduler > code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. > I haven't tested whether recomputation due to blocks being dropped from the > cache can trigger duplicate accumulator updates. > Hypothetically someone could be relying on the current behavior to implement > performance counters that track the actual number of computations performed > (including recomputations). To be safe, we could add an explicit warning in > the release notes that documents the change in behavior when we fix this. > Ignoring duplicate updates shouldn't be too hard, but there are a few > subtleties. Currently, we allow accumulators to be used in multiple > transformations, so we'd need to detect duplicate updates at the > per-transformation level. I haven't dug too deeply into the scheduler > internals, but we might also run into problems where pipelining causes what > is logically one set of accumulator updates to show up in two different tasks > (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause > what's logically the same accumulator update to be applied from two different > contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10643) Support HDFS application download in client mode spark submit
[ https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158873#comment-15158873 ] Jim Lohse edited comment on SPARK-10643 at 2/23/16 1:39 PM: I have not taken the time to test this yet, is this report perhaps having a wider application to other deploy modes and job managers? There is this question on SO, http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs?rq=1 but there user was trying to use HDFS app-jar in local mode. If there's a specific bug on that I haven't found it. Thanks. (also related? https://issues.apache.org/jira/browse/SPARK-8369) was (Author: jiml): I have not taken the time to test this yet, is this report perhaps having a wider application to other deploy modes and job managers? There is this question on SO, http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs?rq=1 but there user was trying to use HDFS app-jar in local mode. If there's a specific bug on that I haven't found it. Thanks. > Support HDFS application download in client mode spark submit > - > > Key: SPARK-10643 > URL: https://issues.apache.org/jira/browse/SPARK-10643 > Project: Spark > Issue Type: New Feature > Components: Spark Submit >Reporter: Alan Braithwaite >Priority: Minor > > When using mesos with docker and marathon, it would be nice to be able to > make spark-submit deployable on marathon and have that download a jar from > HDFS instead of having to package the jar with the docker. > {code} > $ docker run -it docker.example.com/spark:latest > /usr/local/spark/bin/spark-submit --class > com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar > Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. > java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > Although I'm aware that we can run in cluster mode with mesos, we've already > built some nice tools surrounding marathon for logging and monitoring. > Code in question: > https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8369) Support dependency jar and files on HDFS in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158872#comment-15158872 ] Jim Lohse edited comment on SPARK-8369 at 2/23/16 1:39 PM: --- Question, I don't see your spark-submit line is explicitly using standalone cluster mode, is that presumably done elsewhere? Could the clarity of this PR be improved if you add --deploy-mode cluster to your sample spark-submit where it says ... currently ? I started looking at this because of http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs?rq=1 but there user was trying to use HDFS app-jar in local mode. If there's a specific bug on that I haven't found it. Thanks. (Also related? https://issues.apache.org/jira/browse/SPARK-10643) was (Author: jiml): Question, I don't see your spark-submit line is explicitly using standalone cluster mode, is that presumable done elsewhere? Could the clarity of this PR be improved if you add --deploy-mode cluster to your sample spark-submit where it says ... currently ? I started looking at this because of http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs?rq=1 but there user was trying to use HDFS app-jar in local mode. If there's a specific bug on that I haven't found it. Thanks. > Support dependency jar and files on HDFS in standalone cluster mode > --- > > Key: SPARK-8369 > URL: https://issues.apache.org/jira/browse/SPARK-8369 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Dong Lei > > Currently, in standalone cluster mode, spark can take care of the app-jar > whether the app-jar is specified by file:// or hdfs://. But the dependencies > specified by --jars and --files do not support a hdfs:// prefix. > For example: > spark-submit > ... > --jars hdfs://path1/1.jar hdfs://path2/2.jar > --files hdfs://path3/3.file hdfs://path4/4.file > hdfs://path5/app.jar > only app.jar will be downloaded to the driver and distributed to executors, > others (1.jar, 2.jar. 3.file, 4.file) will not. > I think such a feature is useful for users. > > To support such a feature, I think we can treat the jars and files like the > app jar in DriverRunner. We download them and replace the remote addresses > with local addresses. And the DriverWrapper will not be aware. > The problem is it's not easy to replace these addresses than replace the > location app jar, because we have a placeholder for app jar "<>". > We may need to do some string matching to achieve it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10643) Support HDFS application download in client mode spark submit
[ https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158873#comment-15158873 ] Jim Lohse commented on SPARK-10643: --- I have not taken the time to test this yet, is this report perhaps having a wider application to other deploy modes and job managers? There is this question on SO, http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs?rq=1 but there user was trying to use HDFS app-jar in local mode. If there's a specific bug on that I haven't found it. Thanks. > Support HDFS application download in client mode spark submit > - > > Key: SPARK-10643 > URL: https://issues.apache.org/jira/browse/SPARK-10643 > Project: Spark > Issue Type: New Feature > Components: Spark Submit >Reporter: Alan Braithwaite >Priority: Minor > > When using mesos with docker and marathon, it would be nice to be able to > make spark-submit deployable on marathon and have that download a jar from > HDFS instead of having to package the jar with the docker. > {code} > $ docker run -it docker.example.com/spark:latest > /usr/local/spark/bin/spark-submit --class > com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar > Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. > java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > Although I'm aware that we can run in cluster mode with mesos, we've already > built some nice tools surrounding marathon for logging and monitoring. > Code in question: > https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8369) Support dependency jar and files on HDFS in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158872#comment-15158872 ] Jim Lohse commented on SPARK-8369: -- Question, I don't see your spark-submit line is explicitly using standalone cluster mode, is that presumable done elsewhere? Could the clarity of this PR be improved if you add --deploy-mode cluster to your sample spark-submit where it says ... currently ? I started looking at this because of http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs?rq=1 but there user was trying to use HDFS app-jar in local mode. If there's a specific bug on that I haven't found it. Thanks. > Support dependency jar and files on HDFS in standalone cluster mode > --- > > Key: SPARK-8369 > URL: https://issues.apache.org/jira/browse/SPARK-8369 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Dong Lei > > Currently, in standalone cluster mode, spark can take care of the app-jar > whether the app-jar is specified by file:// or hdfs://. But the dependencies > specified by --jars and --files do not support a hdfs:// prefix. > For example: > spark-submit > ... > --jars hdfs://path1/1.jar hdfs://path2/2.jar > --files hdfs://path3/3.file hdfs://path4/4.file > hdfs://path5/app.jar > only app.jar will be downloaded to the driver and distributed to executors, > others (1.jar, 2.jar. 3.file, 4.file) will not. > I think such a feature is useful for users. > > To support such a feature, I think we can treat the jars and files like the > app jar in DriverRunner. We download them and replace the remote addresses > with local addresses. And the DriverWrapper will not be aware. > The problem is it's not easy to replace these addresses than replace the > location app jar, because we have a placeholder for app jar "<>". > We may need to do some string matching to achieve it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1792) Missing Spark-Shell Configure Options
[ https://issues.apache.org/jira/browse/SPARK-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120248#comment-15120248 ] Jim Lohse commented on SPARK-1792: -- Was there any movement on this? I am seeing an unresolved status and a date of 2014? Does it really exist and it not deprecated? Thanks BACKGROUND FWIW: As a newb to Spark for a month or two I didn't realize there was a SPARK_SUBMIT_OPTS that can be set in spark-env.sh. I found this in this blog page: https://abrv8.wordpress.com/2014/10/06/debugging-java-spark-applications/ That blog pages was linked from this SO question: http://stackoverflow.com/questions/29090745/debugging-spark-applications > Missing Spark-Shell Configure Options > - > > Key: SPARK-1792 > URL: https://issues.apache.org/jira/browse/SPARK-1792 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Reporter: Joseph E. Gonzalez > > The `conf/spark-env.sh.template` does not have configure options for the > spark shell. For example to enable Kryo for GraphX when using the spark > shell in stand alone mode it appears you must add: > {code} > SPARK_SUBMIT_OPTS="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer > " > SPARK_SUBMIT_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator > " > {code} > However SPARK_SUBMIT_OPTS is not documented anywhere. Perhaps the > spark-shell should have its own options (e.g., SPARK_SHELL_OPTS). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11570) ambiguous hostname resolving during startup
[ https://issues.apache.org/jira/browse/SPARK-11570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097274#comment-15097274 ] Jim Lohse commented on SPARK-11570: --- [~srowen] Do you mean SPARK_LOCAL_IP? I don't see SPARK_LOCAL_HOSTNAME in spark-env.sh. Thanks, if it exists and needs to be added to spark-env.sh I will submit a pull request. > ambiguous hostname resolving during startup > --- > > Key: SPARK-11570 > URL: https://issues.apache.org/jira/browse/SPARK-11570 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1 > Environment: standalone >Reporter: Sergey Soldatov > > when master is running in standalone mode, it expects that the hostname will > be provided by --host. This is done by the start-master.sh script which sets > --ip to `hostname` (doesn't it looks weird?). > If someone running master directly (without start-master.sh), the hostname > will be initialized as var host = Utils.localHostName(). Before SPARK-6440 > that was exactly the host name where master was started. But now it returns > ip address instead. That would lead to worker connectivity problems (like you > may observer in BIGTOP-2113). If ip addresses are prohibited, than var host > should not be initialized in that way. > Possible solutions - return the logic to set it to the host name or fail if > no --host argument was provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11570) ambiguous hostname resolving during startup
[ https://issues.apache.org/jira/browse/SPARK-11570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097276#comment-15097276 ] Jim Lohse commented on SPARK-11570: --- Do you mean SPARK_LOCAL_IP? I don't see SPARK_LOCAL_HOSTNAME in spark-env.sh. Thanks, if it exists and needs to be added to spark-env.sh I will submit a pull request. > ambiguous hostname resolving during startup > --- > > Key: SPARK-11570 > URL: https://issues.apache.org/jira/browse/SPARK-11570 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1 > Environment: standalone >Reporter: Sergey Soldatov > > when master is running in standalone mode, it expects that the hostname will > be provided by --host. This is done by the start-master.sh script which sets > --ip to `hostname` (doesn't it looks weird?). > If someone running master directly (without start-master.sh), the hostname > will be initialized as var host = Utils.localHostName(). Before SPARK-6440 > that was exactly the host name where master was started. But now it returns > ip address instead. That would lead to worker connectivity problems (like you > may observer in BIGTOP-2113). If ip addresses are prohibited, than var host > should not be initialized in that way. > Possible solutions - return the logic to set it to the host name or fail if > no --host argument was provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11570) ambiguous hostname resolving during startup
[ https://issues.apache.org/jira/browse/SPARK-11570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Lohse updated SPARK-11570: -- Comment: was deleted (was: [~srowen] Do you mean SPARK_LOCAL_IP? I don't see SPARK_LOCAL_HOSTNAME in spark-env.sh. Thanks, if it exists and needs to be added to spark-env.sh I will submit a pull request.) > ambiguous hostname resolving during startup > --- > > Key: SPARK-11570 > URL: https://issues.apache.org/jira/browse/SPARK-11570 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1 > Environment: standalone >Reporter: Sergey Soldatov > > when master is running in standalone mode, it expects that the hostname will > be provided by --host. This is done by the start-master.sh script which sets > --ip to `hostname` (doesn't it looks weird?). > If someone running master directly (without start-master.sh), the hostname > will be initialized as var host = Utils.localHostName(). Before SPARK-6440 > that was exactly the host name where master was started. But now it returns > ip address instead. That would lead to worker connectivity problems (like you > may observer in BIGTOP-2113). If ip addresses are prohibited, than var host > should not be initialized in that way. > Possible solutions - return the logic to set it to the host name or fail if > no --host argument was provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12528) Make Apache Spark’s gateway hidden REST API (in standalone cluster mode) public API
[ https://issues.apache.org/jira/browse/SPARK-12528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076572#comment-15076572 ] Jim Lohse commented on SPARK-12528: --- This PR is relevant: https://issues.apache.org/jira/browse/SPARK-12528 Provide a stable application submission gateway in standalone cluster mode > Make Apache Spark’s gateway hidden REST API (in standalone cluster mode) > public API > --- > > Key: SPARK-12528 > URL: https://issues.apache.org/jira/browse/SPARK-12528 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.0.0 >Reporter: Youcef HILEM >Priority: Minor > > Spark has a hidden REST API which handles application submission, status > checking and cancellation (https://issues.apache.org/jira/browse/SPARK-5388). > There is enough interest using this API to justify making it public : > - https://github.com/ywilkof/spark-jobs-rest-client > - https://github.com/yohanliyanage/jenkins-spark-deploy > - https://github.com/spark-jobserver/spark-jobserver > - http://stackoverflow.com/questions/28992802/triggering-spark-jobs-with-rest > - http://stackoverflow.com/questions/34225879/how-to-submit-a-job-via-rest-api > - http://arturmkrtchyan.com/apache-spark-hidden-rest-api -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org