Re: Release date for new pyspark
Thanks all! (And thanks Matei for the developer link!) I was able to build using maven[1] but `./sbt/sbt assembly` results in build errors. (Not familiar enough with the build to know why; in the past sbt worked for me and maven did not). I was able to run the master version of pyspark, which was what I wanted, though I discovered a bug when trying to read spark-pickled data from HDFS. (Looks similar to https://spark-project.atlassian.net/browse/SPARK-1034 from my naive point of view). For the curious: Code: conf = SparkConf() conf.set('spark.local.dir', '/nail/tmp') conf.set('spark.executor.memory', '28g') conf.set('spark.app.name', 'test') sc = SparkContext(conf=conf) sc.parallelize(range(10)).saveAsPickleFile(hdfs://host:9000/test_pickle) unpickled_rdd = sc.pickleFile(hdfs://host:9000/test_pickle) print unpickled_rdd.takeSample(False, 3) Traceback (most recent call last): File /path/to/my/home/spark-master/tast.py, line 33, in module print unpickled_rdd.takeSample(False, 3) File /path/to/my/home/spark-master/python/pyspark/rdd.py, line 391, in takeSample initialCount = self.count() File /path/to/my/home/spark-master/python/pyspark/rdd.py, line 791, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /path/to/my/home/spark-master/python/pyspark/rdd.py, line 782, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /path/to/my/home/spark-master/python/pyspark/rdd.py, line 703, in reduce vals = self.mapPartitions(func).collect() File /path/to/my/home/spark-master/python/pyspark/rdd.py, line 667, in collect bytesInJava = self._jrdd.collect().iterator() File /path/to/my/home/spark-master/python/pyspark/rdd.py, line 1600, in _jrdd class_tag) File /path/to/my/home/spark-master/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 669, in __call__ File /path/to/my/home/spark-master/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonRDD. Trace: py4j.Py4JException: Constructor org.apache.spark.api.python.PythonRDD([class org.apache.spark.rdd.FlatMappedRDD, class [B, class java.util.HashMap, class java.util.ArrayList, class java.lang.Boolean, class java.lang.String, class java.util.ArrayList, class org.apache.spark.Accumulator, class scala.reflect.ManifestFactory$$anon$2]) does not exist at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:184) at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:202) at py4j.Gateway.invoke(Gateway.java:213) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:662) [1] mvn -Phadoop-2.3 -Dhadoop.verson=2.3.0 -DskipTests clean package On Wed, Jul 16, 2014 at 8:39 PM, Michael Armbrust mich...@databricks.com wrote: You should try cleaning and then building. We have recently hit a bug in the scala compiler that sometimes causes non-clean builds to fail. On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, we try to have a regular 3 month release cycle; see https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the current window. Matei On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote: You should expect master to compile and run: patches aren't merged unless they build and pass tests on Jenkins. You shouldn't expect new features to be added to stable code in maintenance releases (e.g. 1.0.1). AFAIK, we're still on track with Spark 1.1.0 development, which means that it should be released sometime in the second half of next month (or shortly thereafter). On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote: Dear List, The version of pyspark on master has a lot of nice new features, e.g. SequenceFile reading, pickle i/o, etc: https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353 I downloaded the recent 1.0.1 release and was surprised to see the distribution did not include these changes in master. (I've tried pulling master [ 9c249743ea ] and compiling from source, but I get a build failure in TestSQLContext.scala FWIW). Is an updated pyspark scheduled for the next release? (Also, am I wrong in expecting HEAD on master should probably compile and run?) Best Regards, -Paul Wais
Release date for new pyspark
Dear List, The version of pyspark on master has a lot of nice new features, e.g. SequenceFile reading, pickle i/o, etc: https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353 I downloaded the recent 1.0.1 release and was surprised to see the distribution did not include these changes in master. (I've tried pulling master [ 9c249743ea ] and compiling from source, but I get a build failure in TestSQLContext.scala FWIW). Is an updated pyspark scheduled for the next release? (Also, am I wrong in expecting HEAD on master should probably compile and run?) Best Regards, -Paul Wais
Re: Release date for new pyspark
You should expect master to compile and run: patches aren't merged unless they build and pass tests on Jenkins. You shouldn't expect new features to be added to stable code in maintenance releases (e.g. 1.0.1). AFAIK, we're still on track with Spark 1.1.0 development, which means that it should be released sometime in the second half of next month (or shortly thereafter). On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote: Dear List, The version of pyspark on master has a lot of nice new features, e.g. SequenceFile reading, pickle i/o, etc: https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353 I downloaded the recent 1.0.1 release and was surprised to see the distribution did not include these changes in master. (I've tried pulling master [ 9c249743ea ] and compiling from source, but I get a build failure in TestSQLContext.scala FWIW). Is an updated pyspark scheduled for the next release? (Also, am I wrong in expecting HEAD on master should probably compile and run?) Best Regards, -Paul Wais
Re: Release date for new pyspark
Yeah, we try to have a regular 3 month release cycle; see https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the current window. Matei On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote: You should expect master to compile and run: patches aren't merged unless they build and pass tests on Jenkins. You shouldn't expect new features to be added to stable code in maintenance releases (e.g. 1.0.1). AFAIK, we're still on track with Spark 1.1.0 development, which means that it should be released sometime in the second half of next month (or shortly thereafter). On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote: Dear List, The version of pyspark on master has a lot of nice new features, e.g. SequenceFile reading, pickle i/o, etc: https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353 I downloaded the recent 1.0.1 release and was surprised to see the distribution did not include these changes in master. (I've tried pulling master [ 9c249743ea ] and compiling from source, but I get a build failure in TestSQLContext.scala FWIW). Is an updated pyspark scheduled for the next release? (Also, am I wrong in expecting HEAD on master should probably compile and run?) Best Regards, -Paul Wais
Re: Release date for new pyspark
You should try cleaning and then building. We have recently hit a bug in the scala compiler that sometimes causes non-clean builds to fail. On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, we try to have a regular 3 month release cycle; see https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the current window. Matei On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote: You should expect master to compile and run: patches aren't merged unless they build and pass tests on Jenkins. You shouldn't expect new features to be added to stable code in maintenance releases (e.g. 1.0.1). AFAIK, we're still on track with Spark 1.1.0 development, which means that it should be released sometime in the second half of next month (or shortly thereafter). On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote: Dear List, The version of pyspark on master has a lot of nice new features, e.g. SequenceFile reading, pickle i/o, etc: https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353 I downloaded the recent 1.0.1 release and was surprised to see the distribution did not include these changes in master. (I've tried pulling master [ 9c249743ea ] and compiling from source, but I get a build failure in TestSQLContext.scala FWIW). Is an updated pyspark scheduled for the next release? (Also, am I wrong in expecting HEAD on master should probably compile and run?) Best Regards, -Paul Wais