Re: Release date for new pyspark

2014-07-17 Thread Paul Wais
Thanks all!  (And thanks Matei for the developer link!)  I was able to
build using maven[1] but `./sbt/sbt assembly` results in build errors.
(Not familiar enough with the build to know why; in the past sbt
worked for me and maven did not).

I was able to run the master version of pyspark, which was what I
wanted, though I discovered a bug when trying to read spark-pickled
data from HDFS.  (Looks similar to
https://spark-project.atlassian.net/browse/SPARK-1034 from my naive
point of view).  For the curious:

Code:

conf = SparkConf()
conf.set('spark.local.dir', '/nail/tmp')
conf.set('spark.executor.memory', '28g')
conf.set('spark.app.name', 'test')

sc = SparkContext(conf=conf)

sc.parallelize(range(10)).saveAsPickleFile(hdfs://host:9000/test_pickle)
unpickled_rdd = sc.pickleFile(hdfs://host:9000/test_pickle)
print unpickled_rdd.takeSample(False, 3)

Traceback (most recent call last):
  File /path/to/my/home/spark-master/tast.py, line 33, in module
print unpickled_rdd.takeSample(False, 3)
  File /path/to/my/home/spark-master/python/pyspark/rdd.py, line
391, in takeSample
initialCount = self.count()
  File /path/to/my/home/spark-master/python/pyspark/rdd.py, line 791, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /path/to/my/home/spark-master/python/pyspark/rdd.py, line 782, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File /path/to/my/home/spark-master/python/pyspark/rdd.py, line
703, in reduce
vals = self.mapPartitions(func).collect()
  File /path/to/my/home/spark-master/python/pyspark/rdd.py, line
667, in collect
bytesInJava = self._jrdd.collect().iterator()
  File /path/to/my/home/spark-master/python/pyspark/rdd.py, line
1600, in _jrdd
class_tag)
  File 
/path/to/my/home/spark-master/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
line 669, in __call__
  File 
/path/to/my/home/spark-master/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py,
line 304, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling
None.org.apache.spark.api.python.PythonRDD. Trace:
py4j.Py4JException: Constructor
org.apache.spark.api.python.PythonRDD([class
org.apache.spark.rdd.FlatMappedRDD, class [B, class java.util.HashMap,
class java.util.ArrayList, class java.lang.Boolean, class
java.lang.String, class java.util.ArrayList, class
org.apache.spark.Accumulator, class
scala.reflect.ManifestFactory$$anon$2]) does not exist
at 
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:184)
at 
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:202)
at py4j.Gateway.invoke(Gateway.java:213)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:662)


[1] mvn -Phadoop-2.3 -Dhadoop.verson=2.3.0 -DskipTests clean package

On Wed, Jul 16, 2014 at 8:39 PM, Michael Armbrust
mich...@databricks.com wrote:
 You should try cleaning and then building.  We have recently hit a bug in
 the scala compiler that sometimes causes non-clean builds to fail.


 On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Yeah, we try to have a regular 3 month release cycle; see
 https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the
 current window.

 Matei

 On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote:

 You should expect master to compile and run: patches aren't merged unless
 they build and pass tests on Jenkins.

 You shouldn't expect new features to be added to stable code in
 maintenance releases (e.g. 1.0.1).

 AFAIK, we're still on track with Spark 1.1.0 development, which means that
 it should be released sometime in the second half of next month (or shortly
 thereafter).


 On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote:

 Dear List,

 The version of pyspark on master has a lot of nice new features, e.g.
 SequenceFile reading, pickle i/o, etc:
 https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353

 I downloaded the recent 1.0.1 release and was surprised to see the
 distribution did not include these changes in master.  (I've tried pulling
 master [ 9c249743ea ] and compiling from source, but I get a build failure
 in TestSQLContext.scala FWIW).

 Is an updated pyspark scheduled for the next release?  (Also, am I wrong
 in expecting HEAD on master should probably compile and run?)

 Best Regards,
 -Paul Wais






Release date for new pyspark

2014-07-16 Thread Paul Wais
Dear List,

The version of pyspark on master has a lot of nice new features, e.g.
SequenceFile reading, pickle i/o, etc:
https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353

I downloaded the recent 1.0.1 release and was surprised to see the
distribution did not include these changes in master.  (I've tried pulling
master [ 9c249743ea ] and compiling from source, but I get a build failure
in TestSQLContext.scala FWIW).

Is an updated pyspark scheduled for the next release?  (Also, am I wrong in
expecting HEAD on master should probably compile and run?)

Best Regards,
-Paul Wais


Re: Release date for new pyspark

2014-07-16 Thread Mark Hamstra
You should expect master to compile and run: patches aren't merged unless
they build and pass tests on Jenkins.

You shouldn't expect new features to be added to stable code in maintenance
releases (e.g. 1.0.1).

AFAIK, we're still on track with Spark 1.1.0 development, which means that
it should be released sometime in the second half of next month (or shortly
thereafter).


On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote:

 Dear List,

 The version of pyspark on master has a lot of nice new features, e.g.
 SequenceFile reading, pickle i/o, etc:
 https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353

 I downloaded the recent 1.0.1 release and was surprised to see the
 distribution did not include these changes in master.  (I've tried pulling
 master [ 9c249743ea ] and compiling from source, but I get a build failure
 in TestSQLContext.scala FWIW).

 Is an updated pyspark scheduled for the next release?  (Also, am I wrong
 in expecting HEAD on master should probably compile and run?)

 Best Regards,
 -Paul Wais



Re: Release date for new pyspark

2014-07-16 Thread Matei Zaharia
Yeah, we try to have a regular 3 month release cycle; see 
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the current 
window.

Matei

On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote:

 You should expect master to compile and run: patches aren't merged unless 
 they build and pass tests on Jenkins.
 
 You shouldn't expect new features to be added to stable code in maintenance 
 releases (e.g. 1.0.1).
 
 AFAIK, we're still on track with Spark 1.1.0 development, which means that it 
 should be released sometime in the second half of next month (or shortly 
 thereafter).
 
 
 On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote:
 Dear List,
 
 The version of pyspark on master has a lot of nice new features, e.g. 
 SequenceFile reading, pickle i/o, etc: 
 https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353
 
 I downloaded the recent 1.0.1 release and was surprised to see the 
 distribution did not include these changes in master.  (I've tried pulling 
 master [ 9c249743ea ] and compiling from source, but I get a build failure in 
 TestSQLContext.scala FWIW).
 
 Is an updated pyspark scheduled for the next release?  (Also, am I wrong in 
 expecting HEAD on master should probably compile and run?)
 
 Best Regards,
 -Paul Wais
 



Re: Release date for new pyspark

2014-07-16 Thread Michael Armbrust
You should try cleaning and then building.  We have recently hit a bug in
the scala compiler that sometimes causes non-clean builds to fail.


On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Yeah, we try to have a regular 3 month release cycle; see
 https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the
 current window.

 Matei

 On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote:

 You should expect master to compile and run: patches aren't merged unless
 they build and pass tests on Jenkins.

 You shouldn't expect new features to be added to stable code in
 maintenance releases (e.g. 1.0.1).

 AFAIK, we're still on track with Spark 1.1.0 development, which means that
 it should be released sometime in the second half of next month (or shortly
 thereafter).


 On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote:

 Dear List,

 The version of pyspark on master has a lot of nice new features, e.g.
 SequenceFile reading, pickle i/o, etc:
 https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353

 I downloaded the recent 1.0.1 release and was surprised to see the
 distribution did not include these changes in master.  (I've tried pulling
 master [ 9c249743ea ] and compiling from source, but I get a build failure
 in TestSQLContext.scala FWIW).

 Is an updated pyspark scheduled for the next release?  (Also, am I wrong
 in expecting HEAD on master should probably compile and run?)

 Best Regards,
 -Paul Wais