Parallelize foreach in PySpark with Spark Standalone

kdunn Mon, 04 May 2015 08:49:07 -0700

Full disclosure, I am *brand* new to Spark. 

I am trying to use [Py]SparkSQL standalone to pre-process a bunch of *local*
(non HDFS) Parquet files. I have several thousand files and want to dispatch
as many workers as my machine can handle to process the data in parallel;
either at the per-file or per-record (or batch of records) within a single
file.


My question is, how can this be achieved in a standalone scenario? I have
plenty cores and RAM yet when I do `sc = SparkContext("local[8]")` in my
stand alone script I see no speedup compared to, say, local[1]. I've also
tried something like : distData = sc.parallelize(data) then
distData.foreach(myFunction) after starting with local[N], yet that seems to
return immediately without producing the expected side effects from
myFunction (file output).

I realize parallelizing Python code on a single node cluster is not what
Spark was designed for but it seems to integrate Parquet and Python so well
that it's my only option. :)


Thanks,
Kyle



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parallelize-foreach-in-PySpark-with-Spark-Standalone-tp22756.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Parallelize foreach in PySpark with Spark Standalone

Reply via email to