Re: Release date for new pyspark

Paul Wais Thu, 17 Jul 2014 15:56:24 -0700

Thanks all!  (And thanks Matei for the developer link!)  I was able to
build using maven[1] but `./sbt/sbt assembly` results in build errors.
(Not familiar enough with the build to know why; in the past sbt
worked for me and maven did not).


I was able to run the master version of pyspark, which was what I
wanted, though I discovered a bug when trying to read spark-pickled
data from HDFS.  (Looks similar to
https://spark-project.atlassian.net/browse/SPARK-1034 from my naive
point of view).  For the curious:

Code:

conf = SparkConf()
conf.set('spark.local.dir', '/nail/tmp')
conf.set('spark.executor.memory', '28g')
conf.set('spark.app.name', 'test')

sc = SparkContext(conf=conf)

sc.parallelize(range(10)).saveAsPickleFile("hdfs://host:9000/test_pickle")
unpickled_rdd = sc.pickleFile("hdfs://host:9000/test_pickle")
print unpickled_rdd.takeSample(False, 3)

Traceback (most recent call last):
  File "/path/to/my/home/spark-master/tast.py", line 33, in <module>
    print unpickled_rdd.takeSample(False, 3)
  File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line
391, in takeSample
    initialCount = self.count()
  File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line 791, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line 782, in sum
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line
703, in reduce
    vals = self.mapPartitions(func).collect()
  File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line
667, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File "/path/to/my/home/spark-master/python/pyspark/rdd.py", line
1600, in _jrdd
    class_tag)
  File 
"/path/to/my/home/spark-master/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 669, in __call__
  File 
"/path/to/my/home/spark-master/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
line 304, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling
None.org.apache.spark.api.python.PythonRDD. Trace:
py4j.Py4JException: Constructor
org.apache.spark.api.python.PythonRDD([class
org.apache.spark.rdd.FlatMappedRDD, class [B, class java.util.HashMap,
class java.util.ArrayList, class java.lang.Boolean, class
java.lang.String, class java.util.ArrayList, class
org.apache.spark.Accumulator, class
scala.reflect.ManifestFactory$$anon$2]) does not exist
        at 
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:184)
        at 
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:202)
        at py4j.Gateway.invoke(Gateway.java:213)
        at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:662)


[1] mvn -Phadoop-2.3 -Dhadoop.verson=2.3.0 -DskipTests clean package

On Wed, Jul 16, 2014 at 8:39 PM, Michael Armbrust
<mich...@databricks.com> wrote:
> You should try cleaning and then building.  We have recently hit a bug in
> the scala compiler that sometimes causes non-clean builds to fail.
>
>
> On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>>
>> Yeah, we try to have a regular 3 month release cycle; see
>> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the
>> current window.
>>
>> Matei
>>
>> On Jul 16, 2014, at 4:21 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
>>
>> You should expect master to compile and run: patches aren't merged unless
>> they build and pass tests on Jenkins.
>>
>> You shouldn't expect new features to be added to stable code in
>> maintenance releases (e.g. 1.0.1).
>>
>> AFAIK, we're still on track with Spark 1.1.0 development, which means that
>> it should be released sometime in the second half of next month (or shortly
>> thereafter).
>>
>>
>> On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais <pw...@yelp.com> wrote:
>>>
>>> Dear List,
>>>
>>> The version of pyspark on master has a lot of nice new features, e.g.
>>> SequenceFile reading, pickle i/o, etc:
>>> https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353
>>>
>>> I downloaded the recent 1.0.1 release and was surprised to see the
>>> distribution did not include these changes in master.  (I've tried pulling
>>> master [ 9c249743ea ] and compiling from source, but I get a build failure
>>> in TestSQLContext.scala FWIW).
>>>
>>> Is an updated pyspark scheduled for the next release?  (Also, am I wrong
>>> in expecting HEAD on master should probably compile and run?)
>>>
>>> Best Regards,
>>> -Paul Wais
>>
>>
>>
>

Re: Release date for new pyspark

Reply via email to