Re: iteratively modifying an RDD

Rok Roskar Wed, 11 Feb 2015 11:04:02 -0800

Yes I actually do use mapPartitions already
On Feb 11, 2015 7:55 PM, "Charles Feduke" <charles.fed...@gmail.com> wrote:


> If you use mapPartitions to iterate the lookup_tables does that improve
> the performance?
>
> This link is to Spark docs 1.1 because both latest and 1.2 for Python give
> me a 404:
> http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions
>
> On Wed Feb 11 2015 at 1:48:42 PM rok <rokros...@gmail.com> wrote:
>
>> I was having trouble with memory exceptions when broadcasting a large
>> lookup
>> table, so I've resorted to processing it iteratively -- but how can I
>> modify
>> an RDD iteratively?
>>
>> I'm trying something like :
>>
>> rdd = sc.parallelize(...)
>> lookup_tables = {...}
>>
>> for lookup_table in lookup_tables :
>>     rdd = rdd.map(lambda x: func(x, lookup_table))
>>
>> If I leave it as is, then only the last "lookup_table" is applied instead
>> of
>> stringing together all the maps. However, if add a .cache() to the .map
>> then
>> it seems to work fine.
>>
>> A second problem is that the runtime for each iteration roughly doubles at
>> each iteration so this clearly doesn't seem to be the way to do it. What
>> is
>> the preferred way of doing such repeated modifications to an RDD and how
>> can
>> the accumulation of overhead be minimized?
>>
>> Thanks!
>>
>> Rok
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: iteratively modifying an RDD

Reply via email to