Well it is now...
The RDD had a repartition call on it.
When I removed repartition it it it would work,
When i did not remove the repartition but called
called rdd.partitions.length on it it would also work!
I looked into the partitions method and in it some instance variables get
initialized, so saying rdd's are immutable is only true on a "logical" level
It seems I ran into https://issues.apache.org/jira/browse/SPARK-28917
And it looks like this change fixed it
https://github.com/apache/spark/blame/485145326a9c97ede260b0e267ee116f182cfd56/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L298
But since we're using an old version that does not really help
On Fri, 22 Jan 2021 at 15:34, Sean Owen wrote:
> RDDs are immutable, and Spark itself is thread-safe. This should be fine.
> Something else is going on in your code.
>
> On Fri, Jan 22, 2021 at 7:59 AM jelmer wrote:
>
>> HI,
>>
>> I have a piece of code in which an rdd is created from a main method.
>> It then does work on this rdd from 2 different threads running in
>> parallel.
>>
>> When running this code as part of a test with a local master it will
>> sometimes make spark hang ( 1 task will never get completed)
>>
>> If i make a copy of the rdd the joh will complete fine.
>>
>> I suspect it's a bad idea to use the same rdd from two threads but I
>> could not find any documentation on the subject.
>>
>> Should it be possible to do this and if not can anyone point me to
>> documentation pointing our that this is not on the table
>>
>> --jelmer
>>
>