Yes I actually do use mapPartitions already On Feb 11, 2015 7:55 PM, "Charles Feduke" <charles.fed...@gmail.com> wrote:
> If you use mapPartitions to iterate the lookup_tables does that improve > the performance? > > This link is to Spark docs 1.1 because both latest and 1.2 for Python give > me a 404: > http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions > > On Wed Feb 11 2015 at 1:48:42 PM rok <rokros...@gmail.com> wrote: > >> I was having trouble with memory exceptions when broadcasting a large >> lookup >> table, so I've resorted to processing it iteratively -- but how can I >> modify >> an RDD iteratively? >> >> I'm trying something like : >> >> rdd = sc.parallelize(...) >> lookup_tables = {...} >> >> for lookup_table in lookup_tables : >> rdd = rdd.map(lambda x: func(x, lookup_table)) >> >> If I leave it as is, then only the last "lookup_table" is applied instead >> of >> stringing together all the maps. However, if add a .cache() to the .map >> then >> it seems to work fine. >> >> A second problem is that the runtime for each iteration roughly doubles at >> each iteration so this clearly doesn't seem to be the way to do it. What >> is >> the preferred way of doing such repeated modifications to an RDD and how >> can >> the accumulation of overhead be minimized? >> >> Thanks! >> >> Rok >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>