Thanks all! (And thanks Matei for the developer link!) I was able to build using maven[1] but `./sbt/sbt assembly` results in build errors. (Not familiar enough with the build to know why; in the past sbt worked for me and maven did not).
I was able to run the master version of pyspark, which was what I wanted, though I discovered a bug when trying to read spark-pickled data from HDFS. (Looks similar to https://spark-project.atlassian.net/browse/SPARK-1034 from my naive point of view). For the curious: Code: conf = SparkConf() conf.set('spark.local.dir', '/nail/tmp') conf.set('spark.executor.memory', '28g') conf.set('spark.app.name', 'test') sc = SparkContext(conf=conf) sc.parallelize(range(10)).saveAsPickleFile("hdfs://host:9000/test_pickle") unpickled_rdd = sc.pickleFile("hdfs://host:9000/test_pickle") print unpickled_rdd.takeSample(False, 3) Traceback (most recent call last): File "/path/to/my/home/spark-master/tast.py", line 33, in <module> print unpickled_rdd.takeSample(False, 3) File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line 391, in takeSample initialCount = self.count() File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line 791, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line 782, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line 703, in reduce vals = self.mapPartitions(func).collect() File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line 667, in collect bytesInJava = self._jrdd.collect().iterator() File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line 1600, in _jrdd class_tag) File "/path/to/my/home/spark-master/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 669, in __call__ File "/path/to/my/home/spark-master/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonRDD. Trace: py4j.Py4JException: Constructor org.apache.spark.api.python.PythonRDD([class org.apache.spark.rdd.FlatMappedRDD, class [B, class java.util.HashMap, class java.util.ArrayList, class java.lang.Boolean, class java.lang.String, class java.util.ArrayList, class org.apache.spark.Accumulator, class scala.reflect.ManifestFactory$$anon$2]) does not exist at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:184) at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:202) at py4j.Gateway.invoke(Gateway.java:213) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:662) [1] mvn -Phadoop-2.3 -Dhadoop.verson=2.3.0 -DskipTests clean package On Wed, Jul 16, 2014 at 8:39 PM, Michael Armbrust <mich...@databricks.com> wrote: > You should try cleaning and then building. We have recently hit a bug in > the scala compiler that sometimes causes non-clean builds to fail. > > > On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: >> >> Yeah, we try to have a regular 3 month release cycle; see >> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the >> current window. >> >> Matei >> >> On Jul 16, 2014, at 4:21 PM, Mark Hamstra <m...@clearstorydata.com> wrote: >> >> You should expect master to compile and run: patches aren't merged unless >> they build and pass tests on Jenkins. >> >> You shouldn't expect new features to be added to stable code in >> maintenance releases (e.g. 1.0.1). >> >> AFAIK, we're still on track with Spark 1.1.0 development, which means that >> it should be released sometime in the second half of next month (or shortly >> thereafter). >> >> >> On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais <pw...@yelp.com> wrote: >>> >>> Dear List, >>> >>> The version of pyspark on master has a lot of nice new features, e.g. >>> SequenceFile reading, pickle i/o, etc: >>> https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353 >>> >>> I downloaded the recent 1.0.1 release and was surprised to see the >>> distribution did not include these changes in master. (I've tried pulling >>> master [ 9c249743ea ] and compiling from source, but I get a build failure >>> in TestSQLContext.scala FWIW). >>> >>> Is an updated pyspark scheduled for the next release? (Also, am I wrong >>> in expecting HEAD on master should probably compile and run?) >>> >>> Best Regards, >>> -Paul Wais >> >> >> >