On Fri, Oct 24, 2014 at 1:37 PM, xuhongnever <xuhongne...@gmail.com> wrote:
> Thank you very much.
> Changing to groupByKey works, it runs much more faster.
>
> By the way, could you give me some explanation of the following
> configurations, after reading the official explanation, i'm still confused,
> what's the relationship between them? is there any memory overlap between
> them?
>
> *spark.python.worker.memory
> spark.executor.memory
> spark.driver.memory*

spark.driver.memory is used for JVM together with you local python
scripts (called driver),
spark.executor.memory is used for JVM in spark cluster (called slave
or executor),

In local mode, driver and executor share the same JVM, so
spark.driver.memory is used.

spark.python.worker.memory is used for Python worker in executor.
Because of GIL,
pyspark use multiple python process in the executor, one for each task.
spark.python.worker.memory will tell the python worker to when to
spill the data into disk.
It's not hard limit, so the memory used in python worker maybe is
little higher than it.
If you have enough memory in executor, increase spark.python.worker.memory will
let python worker to use more memory during shuffle (like groupBy()),
which will increase
the performance.

> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-is-running-extremely-slow-with-larger-data-set-like-2G-tp17152p17231.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to