Storage of RDDs created via sc.parallelize

Karlson Fri, 20 Mar 2015 12:11:09 -0700


Hi all,

where is the data stored that is passed to sc.parallelize? Or putdifferently, where is the data for the base RDD fetched from when theDAG is executed, if the base RDD is constructed via sc.parallelize?

I am reading a csv file via the Python csv module and am feeding theparsed data chunkwise to sc.parallelize, because the whole file wouldnot fit into memory on the driver. Reading the file with sc.textfilefirst is not an option, as there might be linebreaks inside the csvfields, preventing me from parsing the file line by line.

The problem I am facing right now is that even though I am feeding onlyone chunk at a time to Spark, I will eventually run out of memory on thedriver.


Thanks in advance!

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Storage of RDDs created via sc.parallelize

Reply via email to