Hi all,

where is the data stored that is passed to sc.parallelize? Or put differently, where is the data for the base RDD fetched from when the DAG is executed, if the base RDD is constructed via sc.parallelize?

I am reading a csv file via the Python csv module and am feeding the parsed data chunkwise to sc.parallelize, because the whole file would not fit into memory on the driver. Reading the file with sc.textfile first is not an option, as there might be linebreaks inside the csv fields, preventing me from parsing the file line by line.

The problem I am facing right now is that even though I am feeding only one chunk at a time to Spark, I will eventually run out of memory on the driver.

Thanks in advance!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to