Re: Configuring Spark Memory
SO this is good information for standalone, but how is memory distributed within Mesos? There's coarse grain mode where the execute stays active, or theres fine grained mode where it appears each task is it's only process in mesos, how to memory allocations work in these cases? Thanks! On Thu, Jul 24, 2014 at 12:14 PM, Martin Goodson mar...@skimlinks.com wrote: Great - thanks for the clarification Aaron. The offer stands for me to write some documentation and an example that covers this without leaving *any* room for ambiguity. -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Thu, Jul 24, 2014 at 6:09 PM, Aaron Davidson ilike...@gmail.com wrote: Whoops, I was mistaken in my original post last year. By default, there is one executor per node per Spark Context, as you said. spark.executor.memory is the amount of memory that the application requests for each of its executors. SPARK_WORKER_MEMORY is the amount of memory a Spark Worker is willing to allocate in executors. So if you were to set SPARK_WORKER_MEMORY to 8g everywhere on your cluster, and spark.executor.memory to 4g, you would be able to run 2 simultaneous Spark Contexts who get 4g per node. Similarly, if spark.executor.memory were 8g, you could only run 1 Spark Context at a time on the cluster, but it would get all the cluster's memory. On Thu, Jul 24, 2014 at 7:25 AM, Martin Goodson mar...@skimlinks.com wrote: Thank you Nishkam, I have read your code. So, for the sake of my understanding, it seems that for each spark context there is one executor per node? Can anyone confirm this? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi nr...@cloudera.com wrote: See if this helps: https://github.com/nishkamravi2/SparkAutoConfig/ It's a very simple tool for auto-configuring default parameters in Spark. Takes as input high-level parameters (like number of nodes, cores per node, memory per node, etc) and spits out default configuration, user advice and command line. Compile (javac SparkConfigure.java) and run (java SparkConfigure). Also cc'ing dev in case others are interested in helping evolve this over time (by refining the heuristics and adding more parameters). On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson mar...@skimlinks.com wrote: Thanks Andrew, So if there is only one SparkContext there is only one executor per machine? This seems to contradict Aaron's message from the link above: If each machine has 16 GB of RAM and 4 cores, for example, you might set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by Spark.) Am I reading this incorrectly? Anyway our configuration is 21 machines (one master and 20 slaves) each with 60Gb. We would like to use 4 cores per machine. This is pyspark so we want to leave say 16Gb on each machine for python processes. Thanks again for the advice! -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote: Hi Martin, In standalone mode, each SparkContext you initialize gets its own set of executors across the cluster. So for example if you have two shells open, they'll each get two JVMs on each worker machine in the cluster. As far as the other docs, you can configure the total number of cores requested for the SparkContext, the amount of memory for the executor JVM on each machine, the amount of memory for the Master/Worker daemons (little needed since work is done in executors), and several other settings. Which of those are you interested in? What spec hardware do you have and how do you want to configure it? Andrew On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com wrote: We are having difficulties configuring Spark, partly because we still don't understand some key concepts. For instance, how many executors are there per machine in standalone mode? This is after having closely read the documentation several times: *http://spark.apache.org/docs/latest/configuration.html http://spark.apache.org/docs/latest/configuration.html* *http://spark.apache.org/docs/latest/spark-standalone.html http://spark.apache.org/docs/latest/spark-standalone.html* *http://spark.apache.org/docs/latest/tuning.html http://spark.apache.org/docs/latest/tuning.html* *http://spark.apache.org/docs/latest/cluster-overview.html http://spark.apache.org/docs/latest/cluster-overview.html* The cluster overview has some information here about executors but is ambiguous about whether there are single executors or multiple executors on each machine. This message from Aaron Davidson implies that the executor memory should be set to total available memory on the machine divided by the number of cores:
Re: Suggestion for SPARK-1825
I have a similar issue with SPARK-1767. There are basically three ways to resolve the issue: 1. Use reflection to access classes newer than 0.21 (or whatever the oldest version of Hadoop is that Spark supports) 2. Add a build variant (in Maven this would be a profile) that deals with this. 3. Auto-detect which classes are available and use those. #1 is the easiest for end-users, but it can lead to some ugly code. #2 makes the code look nicer, but requires some effort on the part of people building spark. This can also lead to headaches for IDEs, if people don't remember to select the new profile. (For example, in IntelliJ, you can't see any of the yarn classes when you import the project from Maven without the YARN profile selected.) #3 is something that... I don't know how to do in sbt or Maven. I've been told that an antrun task might work here, but it seems like it could get really tricky. Overall, I'd lean more towards #2 here. best, Colin On Tue, Jul 22, 2014 at 12:47 AM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: (I'm resending this mail since it seems that it was not sent. Sorry if this was already sent.) Hi, A couple of month ago, I made a pull request to fix https://issues.apache.org/jira/browse/SPARK-1825. My pull request is here: https://github.com/apache/spark/pull/899 But that pull request has problems: l It is Hadoop 2.4.0+ only. It won't compile on the versions below it. l The related Hadoop API is marked as '@Unstable'. Here is an idea to remedy the problems: a new Spark configuration variable. Maybe it can be named as spark.yarn.submit.crossplatform. If it is set to true(default is false), the related Spark code can use the hard-coded strings that is the same as the Hadoop API provides, thus avoiding compile error on the Hadoop versions below 2.4.0. Can someone implement this feature, if this idea is acceptable? Currently my knowledge on Spark source code and Scala is limited to implement it myself. To the right person, the modification should be trivial. You can refer to the source code changes of my pull request. Thanks.
Kryo Issue on Spark 1.0.1, Mesos 0.18.2
After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going well. Looking at the Mesos slave logs, I noticed: ERROR KryoSerializer: Failed to run spark.kryo.registrator java.lang.ClassNotFoundException: com/mediacrossing/verrazano/kryo/MxDataRegistrator My spark-env.sh has the following when I run the Spark Shell: export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so export MASTER=mesos://zk://n-01:2181,n-02:2181,n-03:2181/masters export ADD_JARS=/opt/spark/mx-lib/verrazano-assembly.jar # -XX:+UseCompressedOops must be disabled to use more than 32GB RAM SPARK_JAVA_OPTS=-Xss2m -XX:+UseCompressedOops -Dspark.local.dir=/opt/mesos-tmp -Dspark.executor.memory=4g -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.kryo.registrator=com.mediacrossing.verrazano.kryo.MxDataRegistrator -Dspark.kryoserializer.buffer.mb=16 -Dspark.akka.askTimeout=30 I was able to verify that our custom jar was being copied to each worker, but for some reason it is not finding my registrator class. Is anyone else struggling with Kryo on 1.0.x branch?
Re: GraphX graph partitioning strategy
Hi Larry, GraphX's graph constructor leaves the edges in their original partitions by default. To support arbitrary multipass graph partitioning, one idea is to take advantage of that by partitioning the graph externally to GraphX (though possibly using information from GraphX such as the degrees), then pass the partitioned edges to GraphX. For example, if you had an edge partitioning function that needed the full triplet to assign a partition, you could do this as follows: val unpartitionedGraph: Graph[Int, Int] = ...val numPartitions: Int = 128 def getTripletPartition(e: EdgeTriplet[Int, Int]): PartitionID = ... // Get the triplets using GraphX, then use Spark to repartition themval partitionedEdges = unpartitionedGraph.triplets .map(e = (getTripletPartition(e), e)) .partitionBy(new HashPartitioner(numPartitions)) val partitionedGraph = Graph(unpartitionedGraph.vertices, partitionedEdges) A multipass partitioning algorithm could store its results in the edge attribute, and then you could use the code above to do the partitioning. Ankur http://www.ankurdave.com/ On Wed, Jul 23, 2014 at 11:59 PM, Larry Xiao xia...@sjtu.edu.cn wrote: Hi all, I'm implementing graph partitioning strategy for GraphX, learning from researches on graph computing. I have two questions: - a specific implement question: In current design, only vertex ID of src and dst are provided (PartitionStrategy.scala). And some strategies require knowledge about the graph (like degrees) and can consist more than one passes to finally produce the partition ID. So I'm changing the PartitionStrategy.getPartition API to provide more info, but I don't want to make it complex. (the current one looks very clean) - an open question: What advice would you give considering partitioning, considering the procedure Spark adopt on graph processing? Any advice is much appreciated. Best Regards, Larry Xiao Reference Bipartite-oriented Distributed Graph Partitioning for Big Learning. PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs
Re: GraphX graph partitioning strategy
Oops, the code should be: val unpartitionedGraph: Graph[Int, Int] = ...val numPartitions: Int = 128 def getTripletPartition(e: EdgeTriplet[Int, Int]): PartitionID = ... // Get the triplets using GraphX, then use Spark to repartition themval partitionedEdges = unpartitionedGraph.triplets .map(e = (getTripletPartition(e), e)) .partitionBy(new HashPartitioner(numPartitions)) *.map(pair = Edge(pair._2.srcId, pair._2.dstId, pair._2.attr))* val partitionedGraph = Graph(unpartitionedGraph.vertices, partitionedEdges) Ankur http://www.ankurdave.com/
Re: [VOTE] Release Apache Spark 1.0.2 (RC1)
That query is looking at Fix Version not Target Version. The fact that the first one is still open is only because the bug is not resolved in master. It is fixed in 1.0.2. The second one is partially fixed in 1.0.2, but is not worth blocking the release for. On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: TD, there are a couple of unresolved issues slated for 1.0.2 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC . Should they be edited somehow? On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.2. This release fixes a number of bugs in Spark 1.0.1. Some of the notable ones are - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for SPARK-1199. The fix was reverted for 1.0.2. - SPARK-2576: NoClassDefFoundError when executing Spark QL query on HDFS CSV file. The full list is at http://s.apache.org/9NJ The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f The release files, including signatures, digests, etc can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1024/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/ Please vote on releasing this package as Apache Spark 1.0.2! The vote is open until Tuesday, July 29, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.2 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/
Re: [VOTE] Release Apache Spark 1.0.2 (RC1)
The most important issue in this release is actually an ammendment to an earlier fix. The original fix caused a deadlock which was a regression from 1.0.0-1.0.1: Issue: https://issues.apache.org/jira/browse/SPARK-1097 1.0.1 Fix: https://github.com/apache/spark/pull/1273/files (had a deadlock) 1.0.2 Fix: https://github.com/apache/spark/pull/1409/files I failed to correctly label this on JIRA, but I've updated it! On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust mich...@databricks.com wrote: That query is looking at Fix Version not Target Version. The fact that the first one is still open is only because the bug is not resolved in master. It is fixed in 1.0.2. The second one is partially fixed in 1.0.2, but is not worth blocking the release for. On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: TD, there are a couple of unresolved issues slated for 1.0.2 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC . Should they be edited somehow? On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.2. This release fixes a number of bugs in Spark 1.0.1. Some of the notable ones are - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for SPARK-1199. The fix was reverted for 1.0.2. - SPARK-2576: NoClassDefFoundError when executing Spark QL query on HDFS CSV file. The full list is at http://s.apache.org/9NJ The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f The release files, including signatures, digests, etc can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1024/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/ Please vote on releasing this package as Apache Spark 1.0.2! The vote is open until Tuesday, July 29, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.2 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/
Re: Suggestion for SPARK-1825
Yeah I agree reflection is the best solution. Whenever we do reflection we should clearly document in the code which YARN API version corresponds to which code path. I'm guessing since YARN is adding new features... we'll just have to do this over time. - Patrick On Fri, Jul 25, 2014 at 3:35 PM, Reynold Xin r...@databricks.com wrote: Actually reflection is probably a better, lighter weight process for this. An extra project brings more overhead for something simple. On Fri, Jul 25, 2014 at 3:09 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote: So, I'm leaning more towards using reflection for this. Maven profiles could work, but it's tough since we have new stuff coming in in 2.4, 2.5, etc. and the number of profiles will multiply quickly if we have to do it that way. Reflection is the approach HBase took in a similar situation. best, Colin On Fri, Jul 25, 2014 at 11:23 AM, Colin McCabe cmcc...@alumni.cmu.edu wrote: I have a similar issue with SPARK-1767. There are basically three ways to resolve the issue: 1. Use reflection to access classes newer than 0.21 (or whatever the oldest version of Hadoop is that Spark supports) 2. Add a build variant (in Maven this would be a profile) that deals with this. 3. Auto-detect which classes are available and use those. #1 is the easiest for end-users, but it can lead to some ugly code. #2 makes the code look nicer, but requires some effort on the part of people building spark. This can also lead to headaches for IDEs, if people don't remember to select the new profile. (For example, in IntelliJ, you can't see any of the yarn classes when you import the project from Maven without the YARN profile selected.) #3 is something that... I don't know how to do in sbt or Maven. I've been told that an antrun task might work here, but it seems like it could get really tricky. Overall, I'd lean more towards #2 here. best, Colin On Tue, Jul 22, 2014 at 12:47 AM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: (I'm resending this mail since it seems that it was not sent. Sorry if this was already sent.) Hi, A couple of month ago, I made a pull request to fix https://issues.apache.org/jira/browse/SPARK-1825. My pull request is here: https://github.com/apache/spark/pull/899 But that pull request has problems: l It is Hadoop 2.4.0+ only. It won't compile on the versions below it. l The related Hadoop API is marked as '@Unstable'. Here is an idea to remedy the problems: a new Spark configuration variable. Maybe it can be named as spark.yarn.submit.crossplatform. If it is set to true(default is false), the related Spark code can use the hard-coded strings that is the same as the Hadoop API provides, thus avoiding compile error on the Hadoop versions below 2.4.0. Can someone implement this feature, if this idea is acceptable? Currently my knowledge on Spark source code and Scala is limited to implement it myself. To the right person, the modification should be trivial. You can refer to the source code changes of my pull request. Thanks.
Re: [VOTE] Release Apache Spark 1.0.2 (RC1)
HADOOP-10456 is fixed in hadoop 2.4.1 Does this mean that synchronization on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop 2.4.1 ? Cheers On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell pwend...@gmail.com wrote: The most important issue in this release is actually an ammendment to an earlier fix. The original fix caused a deadlock which was a regression from 1.0.0-1.0.1: Issue: https://issues.apache.org/jira/browse/SPARK-1097 1.0.1 Fix: https://github.com/apache/spark/pull/1273/files (had a deadlock) 1.0.2 Fix: https://github.com/apache/spark/pull/1409/files I failed to correctly label this on JIRA, but I've updated it! On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust mich...@databricks.com wrote: That query is looking at Fix Version not Target Version. The fact that the first one is still open is only because the bug is not resolved in master. It is fixed in 1.0.2. The second one is partially fixed in 1.0.2, but is not worth blocking the release for. On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: TD, there are a couple of unresolved issues slated for 1.0.2 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC . Should they be edited somehow? On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.2. This release fixes a number of bugs in Spark 1.0.1. Some of the notable ones are - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for SPARK-1199. The fix was reverted for 1.0.2. - SPARK-2576: NoClassDefFoundError when executing Spark QL query on HDFS CSV file. The full list is at http://s.apache.org/9NJ The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f The release files, including signatures, digests, etc can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1024/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/ Please vote on releasing this package as Apache Spark 1.0.2! The vote is open until Tuesday, July 29, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.2 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/