Re: Configuring Spark Memory

2014-07-25 Thread John Omernik
SO this is good information for standalone, but how is memory distributed
within Mesos?  There's coarse grain mode where the execute stays active, or
theres fine grained mode where it appears each task is it's only process in
mesos, how to memory allocations work in these cases? Thanks!



On Thu, Jul 24, 2014 at 12:14 PM, Martin Goodson mar...@skimlinks.com
wrote:

 Great - thanks for the clarification Aaron. The offer stands for me to
 write some documentation and an example that covers this without leaving
 *any* room for ambiguity.




 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Thu, Jul 24, 2014 at 6:09 PM, Aaron Davidson ilike...@gmail.com
 wrote:

 Whoops, I was mistaken in my original post last year. By default, there
 is one executor per node per Spark Context, as you said.
 spark.executor.memory is the amount of memory that the application
 requests for each of its executors. SPARK_WORKER_MEMORY is the amount of
 memory a Spark Worker is willing to allocate in executors.

 So if you were to set SPARK_WORKER_MEMORY to 8g everywhere on your
 cluster, and spark.executor.memory to 4g, you would be able to run 2
 simultaneous Spark Contexts who get 4g per node. Similarly, if
 spark.executor.memory were 8g, you could only run 1 Spark Context at a time
 on the cluster, but it would get all the cluster's memory.


 On Thu, Jul 24, 2014 at 7:25 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thank you Nishkam,
 I have read your code. So, for the sake of my understanding, it seems
 that for each spark context there is one executor per node? Can anyone
 confirm this?


 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi nr...@cloudera.com
 wrote:

 See if this helps:

 https://github.com/nishkamravi2/SparkAutoConfig/

 It's a very simple tool for auto-configuring default parameters in
 Spark. Takes as input high-level parameters (like number of nodes, cores
 per node, memory per node, etc) and spits out default configuration, user
 advice and command line. Compile (javac SparkConfigure.java) and run (java
 SparkConfigure).

 Also cc'ing dev in case others are interested in helping evolve this
 over time (by refining the heuristics and adding more parameters).


  On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thanks Andrew,

 So if there is only one SparkContext there is only one executor per
 machine? This seems to contradict Aaron's message from the link above:

 If each machine has 16 GB of RAM and 4 cores, for example, you might
 set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by
 Spark.)

 Am I reading this incorrectly?

 Anyway our configuration is 21 machines (one master and 20 slaves)
 each with 60Gb. We would like to use 4 cores per machine. This is pyspark
 so we want to leave say 16Gb on each machine for python processes.

 Thanks again for the advice!



 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com
 wrote:

 Hi Martin,

 In standalone mode, each SparkContext you initialize gets its own set
 of executors across the cluster.  So for example if you have two shells
 open, they'll each get two JVMs on each worker machine in the cluster.

 As far as the other docs, you can configure the total number of cores
 requested for the SparkContext, the amount of memory for the executor JVM
 on each machine, the amount of memory for the Master/Worker daemons 
 (little
 needed since work is done in executors), and several other settings.

 Which of those are you interested in?  What spec hardware do you have
 and how do you want to configure it?

 Andrew


 On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
  wrote:

 We are having difficulties configuring Spark, partly because we
 still don't understand some key concepts. For instance, how many 
 executors
 are there per machine in standalone mode? This is after having
 closely read the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but
 is ambiguous about whether there are single executors or multiple 
 executors
 on each machine.

  This message from Aaron Davidson implies that the executor memory
 should be set to total available memory on the machine divided by the
 number of cores:
 

Re: Suggestion for SPARK-1825

2014-07-25 Thread Colin McCabe
I have a similar issue with SPARK-1767.  There are basically three ways to
resolve the issue:

1. Use reflection to access classes newer than 0.21 (or whatever the oldest
version of Hadoop is that Spark supports)
2. Add a build variant (in Maven this would be a profile) that deals with
this.
3. Auto-detect which classes are available and use those.

#1 is the easiest for end-users, but it can lead to some ugly code.

#2 makes the code look nicer, but requires some effort on the part of
people building spark.  This can also lead to headaches for IDEs, if people
don't remember to select the new profile.  (For example, in IntelliJ, you
can't see any of the yarn classes when you import the project from Maven
without the YARN profile selected.)

#3 is something that... I don't know how to do in sbt or Maven.  I've been
told that an antrun task might work here, but it seems like it could get
really tricky.

Overall, I'd lean more towards #2 here.

best,
Colin


On Tue, Jul 22, 2014 at 12:47 AM, innowireless TaeYun Kim 
taeyun@innowireless.co.kr wrote:

 (I'm resending this mail since it seems that it was not sent. Sorry if this
 was already sent.)

 Hi,



 A couple of month ago, I made a pull request to fix
 https://issues.apache.org/jira/browse/SPARK-1825.

 My pull request is here: https://github.com/apache/spark/pull/899



 But that pull request has problems:

 l  It is Hadoop 2.4.0+ only. It won't compile on the versions below it.

 l  The related Hadoop API is marked as '@Unstable'.



 Here is an idea to remedy the problems: a new Spark configuration variable.

 Maybe it can be named as spark.yarn.submit.crossplatform.

 If it is set to true(default is false), the related Spark code can use
 the
 hard-coded strings that is the same as the Hadoop API provides, thus
 avoiding compile error on the Hadoop versions below 2.4.0.



 Can someone implement this feature, if this idea is acceptable?

 Currently my knowledge on Spark source code and Scala is limited to
 implement it myself.

 To the right person, the modification should be trivial.

 You can refer to the source code changes of my pull request.



 Thanks.






Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going
well.  Looking at the Mesos slave logs, I noticed:

ERROR KryoSerializer: Failed to run spark.kryo.registrator
java.lang.ClassNotFoundException:
com/mediacrossing/verrazano/kryo/MxDataRegistrator

My spark-env.sh has the following when I run the Spark Shell:

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so

export MASTER=mesos://zk://n-01:2181,n-02:2181,n-03:2181/masters

export ADD_JARS=/opt/spark/mx-lib/verrazano-assembly.jar


# -XX:+UseCompressedOops must be disabled to use more than 32GB RAM

SPARK_JAVA_OPTS=-Xss2m -XX:+UseCompressedOops
-Dspark.local.dir=/opt/mesos-tmp -Dspark.executor.memory=4g
 -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
-Dspark.kryo.registrator=com.mediacrossing.verrazano.kryo.MxDataRegistrator
-Dspark.kryoserializer.buffer.mb=16 -Dspark.akka.askTimeout=30


I was able to verify that our custom jar was being copied to each worker,
but for some reason it is not finding my registrator class.  Is anyone else
struggling with Kryo on 1.0.x branch?


Re: GraphX graph partitioning strategy

2014-07-25 Thread Ankur Dave
Hi Larry,

GraphX's graph constructor leaves the edges in their original partitions by
default. To support arbitrary multipass graph partitioning, one idea is to
take advantage of that by partitioning the graph externally to GraphX
(though possibly using information from GraphX such as the degrees), then
pass the partitioned edges to GraphX.

For example, if you had an edge partitioning function that needed the full
triplet to assign a partition, you could do this as follows:

val unpartitionedGraph: Graph[Int, Int] = ...val numPartitions: Int = 128
def getTripletPartition(e: EdgeTriplet[Int, Int]): PartitionID = ...
// Get the triplets using GraphX, then use Spark to repartition
themval partitionedEdges = unpartitionedGraph.triplets
  .map(e = (getTripletPartition(e), e))
  .partitionBy(new HashPartitioner(numPartitions))
val partitionedGraph = Graph(unpartitionedGraph.vertices, partitionedEdges)


A multipass partitioning algorithm could store its results in the edge
attribute, and then you could use the code above to do the partitioning.

Ankur http://www.ankurdave.com/


On Wed, Jul 23, 2014 at 11:59 PM, Larry Xiao xia...@sjtu.edu.cn wrote:

 Hi all,

 I'm implementing graph partitioning strategy for GraphX, learning from
 researches on graph computing.

 I have two questions:

 - a specific implement question:
 In current design, only vertex ID of src and dst are provided
 (PartitionStrategy.scala).
 And some strategies require knowledge about the graph (like degrees) and
 can consist more than one passes to finally produce the partition ID.
 So I'm changing the PartitionStrategy.getPartition API to provide more
 info, but I don't want to make it complex. (the current one looks very
 clean)

 - an open question:
 What advice would you give considering partitioning, considering the
 procedure Spark adopt on graph processing?

 Any advice is much appreciated.

 Best Regards,
 Larry Xiao

 Reference

 Bipartite-oriented Distributed Graph Partitioning for Big Learning.
 PowerLyra : Differentiated Graph Computation and Partitioning on Skewed
 Graphs



Re: GraphX graph partitioning strategy

2014-07-25 Thread Ankur Dave
Oops, the code should be:

val unpartitionedGraph: Graph[Int, Int] = ...val numPartitions: Int = 128
def getTripletPartition(e: EdgeTriplet[Int, Int]): PartitionID = ...
// Get the triplets using GraphX, then use Spark to repartition
themval partitionedEdges = unpartitionedGraph.triplets
  .map(e = (getTripletPartition(e), e))
  .partitionBy(new HashPartitioner(numPartitions))
  *.map(pair = Edge(pair._2.srcId, pair._2.dstId, pair._2.attr))*
val partitionedGraph = Graph(unpartitionedGraph.vertices, partitionedEdges)


Ankur http://www.ankurdave.com/


Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Michael Armbrust
That query is looking at Fix Version not Target Version.  The fact that
the first one is still open is only because the bug is not resolved in
master.  It is fixed in 1.0.2.  The second one is partially fixed in 1.0.2,
but is not worth blocking the release for.


On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 TD, there are a couple of unresolved issues slated for 1.0.2
 
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
 .
 Should they be edited somehow?


 On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das 
 tathagata.das1...@gmail.com
 wrote:

  Please vote on releasing the following candidate as Apache Spark version
  1.0.2.
 
  This release fixes a number of bugs in Spark 1.0.1.
  Some of the notable ones are
  - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
  SPARK-1199. The fix was reverted for 1.0.2.
  - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
  HDFS CSV file.
  The full list is at http://s.apache.org/9NJ
 
  The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
 
  The release files, including signatures, digests, etc can be found at:
  http://people.apache.org/~tdas/spark-1.0.2-rc1/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/tdas.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1024/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.2!
 
  The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
  [ ] +1 Release this package as Apache Spark 1.0.2
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 



Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Patrick Wendell
The most important issue in this release is actually an ammendment to
an earlier fix. The original fix caused a deadlock which was a
regression from 1.0.0-1.0.1:

Issue:
https://issues.apache.org/jira/browse/SPARK-1097

1.0.1 Fix:
https://github.com/apache/spark/pull/1273/files (had a deadlock)

1.0.2 Fix:
https://github.com/apache/spark/pull/1409/files

I failed to correctly label this on JIRA, but I've updated it!

On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
mich...@databricks.com wrote:
 That query is looking at Fix Version not Target Version.  The fact that
 the first one is still open is only because the bug is not resolved in
 master.  It is fixed in 1.0.2.  The second one is partially fixed in 1.0.2,
 but is not worth blocking the release for.


 On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 TD, there are a couple of unresolved issues slated for 1.0.2
 
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
 .
 Should they be edited somehow?


 On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das 
 tathagata.das1...@gmail.com
 wrote:

  Please vote on releasing the following candidate as Apache Spark version
  1.0.2.
 
  This release fixes a number of bugs in Spark 1.0.1.
  Some of the notable ones are
  - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
  SPARK-1199. The fix was reverted for 1.0.2.
  - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
  HDFS CSV file.
  The full list is at http://s.apache.org/9NJ
 
  The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
 
  The release files, including signatures, digests, etc can be found at:
  http://people.apache.org/~tdas/spark-1.0.2-rc1/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/tdas.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1024/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.2!
 
  The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
  [ ] +1 Release this package as Apache Spark 1.0.2
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 



Re: Suggestion for SPARK-1825

2014-07-25 Thread Patrick Wendell
Yeah I agree reflection is the best solution. Whenever we do
reflection we should clearly document in the code which YARN API
version corresponds to which code path. I'm guessing since YARN is
adding new features... we'll just have to do this over time.

- Patrick

On Fri, Jul 25, 2014 at 3:35 PM, Reynold Xin r...@databricks.com wrote:
 Actually reflection is probably a better, lighter weight process for this.
 An extra project brings more overhead for something simple.





 On Fri, Jul 25, 2014 at 3:09 PM, Colin McCabe cmcc...@alumni.cmu.edu
 wrote:

 So, I'm leaning more towards using reflection for this.  Maven profiles
 could work, but it's tough since we have new stuff coming in in 2.4, 2.5,
 etc.  and the number of profiles will multiply quickly if we have to do it
 that way.  Reflection is the approach HBase took in a similar situation.

 best,
 Colin


 On Fri, Jul 25, 2014 at 11:23 AM, Colin McCabe cmcc...@alumni.cmu.edu
 wrote:

  I have a similar issue with SPARK-1767.  There are basically three ways
 to
  resolve the issue:
 
  1. Use reflection to access classes newer than 0.21 (or whatever the
  oldest version of Hadoop is that Spark supports)
  2. Add a build variant (in Maven this would be a profile) that deals with
  this.
  3. Auto-detect which classes are available and use those.
 
  #1 is the easiest for end-users, but it can lead to some ugly code.
 
  #2 makes the code look nicer, but requires some effort on the part of
  people building spark.  This can also lead to headaches for IDEs, if
 people
  don't remember to select the new profile.  (For example, in IntelliJ, you
  can't see any of the yarn classes when you import the project from Maven
  without the YARN profile selected.)
 
  #3 is something that... I don't know how to do in sbt or Maven.  I've
 been
  told that an antrun task might work here, but it seems like it could get
  really tricky.
 
  Overall, I'd lean more towards #2 here.
 
  best,
  Colin
 
 
  On Tue, Jul 22, 2014 at 12:47 AM, innowireless TaeYun Kim 
  taeyun@innowireless.co.kr wrote:
 
  (I'm resending this mail since it seems that it was not sent. Sorry if
  this
  was already sent.)
 
  Hi,
 
 
 
  A couple of month ago, I made a pull request to fix
  https://issues.apache.org/jira/browse/SPARK-1825.
 
  My pull request is here: https://github.com/apache/spark/pull/899
 
 
 
  But that pull request has problems:
 
  l  It is Hadoop 2.4.0+ only. It won't compile on the versions below it.
 
  l  The related Hadoop API is marked as '@Unstable'.
 
 
 
  Here is an idea to remedy the problems: a new Spark configuration
  variable.
 
  Maybe it can be named as spark.yarn.submit.crossplatform.
 
  If it is set to true(default is false), the related Spark code can use
  the
  hard-coded strings that is the same as the Hadoop API provides, thus
  avoiding compile error on the Hadoop versions below 2.4.0.
 
 
 
  Can someone implement this feature, if this idea is acceptable?
 
  Currently my knowledge on Spark source code and Scala is limited to
  implement it myself.
 
  To the right person, the modification should be trivial.
 
  You can refer to the source code changes of my pull request.
 
 
 
  Thanks.
 
 
 
 
 



Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Ted Yu
HADOOP-10456 is fixed in hadoop 2.4.1

Does this mean that synchronization
on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
2.4.1 ?

Cheers


On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell pwend...@gmail.com wrote:

 The most important issue in this release is actually an ammendment to
 an earlier fix. The original fix caused a deadlock which was a
 regression from 1.0.0-1.0.1:

 Issue:
 https://issues.apache.org/jira/browse/SPARK-1097

 1.0.1 Fix:
 https://github.com/apache/spark/pull/1273/files (had a deadlock)

 1.0.2 Fix:
 https://github.com/apache/spark/pull/1409/files

 I failed to correctly label this on JIRA, but I've updated it!

 On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
 mich...@databricks.com wrote:
  That query is looking at Fix Version not Target Version.  The fact
 that
  the first one is still open is only because the bug is not resolved in
  master.  It is fixed in 1.0.2.  The second one is partially fixed in
 1.0.2,
  but is not worth blocking the release for.
 
 
  On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  TD, there are a couple of unresolved issues slated for 1.0.2
  
 
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
  .
  Should they be edited somehow?
 
 
  On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das 
  tathagata.das1...@gmail.com
  wrote:
 
   Please vote on releasing the following candidate as Apache Spark
 version
   1.0.2.
  
   This release fixes a number of bugs in Spark 1.0.1.
   Some of the notable ones are
   - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
   SPARK-1199. The fix was reverted for 1.0.2.
   - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
   HDFS CSV file.
   The full list is at http://s.apache.org/9NJ
  
   The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
  
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
  
   The release files, including signatures, digests, etc can be found at:
   http://people.apache.org/~tdas/spark-1.0.2-rc1/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/tdas.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1024/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
  
   Please vote on releasing this package as Apache Spark 1.0.2!
  
   The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
   a majority of at least 3 +1 PMC votes are cast.
   [ ] +1 Release this package as Apache Spark 1.0.2
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/