ok.... got some headstart: pull the git source to 14719b93ff4ea7c3234a9389621be3c97fa278b9 (first release so that I could at least build it)
then build it according to README.md, then get eclipse setup , with scala-ide then create new scala project, set the project directory to be SCALA_SOURCE_HOME/core instead of the default in eclipse remove the test from source, copy all the jars from SCALA_SOURCE_HOME/lib_managed into a separate dir, then in eclipse add all these as external jars. set ur scala project run time to be 2.10.5 (the one coming with spark seems to be 2.10.4 , eclipse default is 2.9 something) there would be 2 compile errors , one due to Tuple() , change it to Tuple2, another one is "currentThread", change it to Thread.currentThread() then it will build fine I pasted the hello-world from docs , since the "getting started "doc is for latest version, I had to make some minor changes: package spark import spark.SparkContext import spark.SparkContext._ object Tryout { def main(args: Array[String]) { val logFile = "../README.md" // Should be some file on your system val sc = new SparkContext("local", "tryout", ".", List(System.getenv("SPARK_EXAMPLES_JAR"))) val logData = sc.textFile(logFile, 2).cache() // val logData = scala.io.Source.fromFile(args(0)).getLines().toArray val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } then I debug through this and it became fairly clear On Sun, Jul 19, 2015 at 10:13 PM, Yang <teddyyyy...@gmail.com> wrote: > thanks, my point is that earlier versions are normally much simpler so > it's easier to follow. and the basic structure should at least bare great > similarity with latest version > > On Sun, Jul 19, 2015 at 9:27 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> e5c4cd8a5e188592f8786a265 was from 2011. >> >> Not sure why you started with such an early commit. >> >> Spark project has evolved quite fast. >> >> I suggest you clone Spark project from github.com/apache/spark/ and >> start with core/src/main/scala/org/apache/spark/rdd/RDD.scala >> >> Cheers >> >> On Sun, Jul 19, 2015 at 7:44 PM, Yang <teddyyyy...@gmail.com> wrote: >> >>> I'm trying to understand how spark works under the hood, so I tried to >>> read the source code. >>> >>> as I normally do, I downloaded the git source code, reverted to the very >>> first version ( actually e5c4cd8a5e188592f8786a265c0cd073c69ac886 since the >>> first version even lacked the definition of RDD.scala) >>> >>> but the code looks "too simple" and I can't find where the "magic" >>> happens, i.e. a transformation /computation is scheduled on a machine, >>> bytes stored etc. >>> >>> it would be great if someone could show me a path in which the different >>> source files are involved, so that I could read each of them in turn. >>> >>> thanks! >>> yang >>> >> >> >