Nick,

I have encountered strange things like this before (usually when
programming with mutable structures and side-effects), and for me, the
answer was that, until .count (or .first, or similar), is called, your
variable 'a' refers to a set of instructions that only get executed to form
the object you expect when you're asking something of it.  Back before I
was using side-effect-free techniques on immutable data structures, I had
to call .first or .count or similar to get the behavior I wanted.  There
are still special cases where I have to purposefully "collapse" the RDD for
some reason or another.  This may not be new information to you, but I've
encountered similar behavior before and highly suspect this is playing a
role here.


On Mon, May 5, 2014 at 5:52 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> I’m running into something very strange today. I’m getting an error on the
> follow innocuous operations.
>
> a = sc.textFile('s3n://...')
> a = a.repartition(8)
> a = a.map(...)
> c = a.countByKey() # ERRORs out on this action. See below for traceback. [1]
>
> If I add a count() right after the repartition(), this error magically
> goes away.
>
> a = sc.textFile('s3n://...')
> a = a.repartition(8)
> print a.count()
> a = a.map(...)
> c = a.countByKey() # A-OK! WTF?
>
> To top it off, this “fix” is inconsistent. Sometimes, I still get this
> error.
>
> This is strange. How do I get to the bottom of this?
>
> Nick
>
> [1] Here’s the traceback:
>
> Traceback (most recent call last):
>   File "<stdin>", line 7, in <module>
>   File "file.py", line 187, in function_blah
>     c = a.countByKey()
>   File "/root/spark/python/pyspark/rdd.py", line 778, in countByKey
>     return self.map(lambda x: x[0]).countByValue()
>   File "/root/spark/python/pyspark/rdd.py", line 624, in countByValue
>     return self.mapPartitions(countPartition).reduce(mergeMaps)
>   File "/root/spark/python/pyspark/rdd.py", line 505, in reduce
>     vals = self.mapPartitions(func).collect()
>   File "/root/spark/python/pyspark/rdd.py", line 469, in collect
>     bytesInJava = self._jrdd.collect().iterator()
>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 
> 537, in __call__
>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 
> 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o46.collect.
>
>
> ------------------------------
> View this message in context: How can adding a random count() change the
> behavior of my 
> program?<http://apache-spark-user-list.1001560.n3.nabble.com/How-can-adding-a-random-count-change-the-behavior-of-my-program-tp5406.html>
> Sent from the Apache Spark User List mailing list 
> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>

Reply via email to