[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-01-21 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748307#comment-16748307
 ] 

Hyukjin Kwon commented on SPARK-26679:
--

It's fine to have two different configurations to me if there's a possible case 
we can think of.

Would you have any case in your mind? If so, I think we can go for renaming. If 
both are expected to be same, I guess it better uses 
{{spark.executor.pyspark.memory}} for both cases.

> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> ---
>
> Key: SPARK-26679
> URL: https://issues.apache.org/jira/browse/SPARK-26679
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-01-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749557#comment-16749557
 ] 

Hyukjin Kwon commented on SPARK-26679:
--

[~rdblue], if there's no case we can currently come up with (I can't), let's 
combine both. We can think this is like JVM memory limit - it split when it 
fails to allocate more memory, and it goes OOM if it try to use more memory 
than the limit.

> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> ---
>
> Key: SPARK-26679
> URL: https://issues.apache.org/jira/browse/SPARK-26679
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-01-23 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750504#comment-16750504
 ] 

Imran Rashid commented on SPARK-26679:
--

I agree the old name "spark.python.worker.memory" is very confusing.  But I 
also don't see how you'd combine them.  There are two extreme cases: (1) an app 
which does a ton of stuff in python and uses a lot of python memory from user, 
but no usage of the sort machinery and (2) an app which uses the sort machinery 
within python, but makes very little use of allocating memory from user code.  
I don't think you have enough hooks into python's memory allocation to 
automatically spill if suddenly the user is trying to allocate more memory from 
python.

I agree it might make more sense to do something like spark.memory.fraction.  
I'm not sure if we should reuse that config for deciding what fraction of the 
pyspark memory goes to the pyspark shuffle machinery, or if there should be a 
new config spark.memory.pyspark.fraction.  (I can't think of a use case for 
keeping those separate)

> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> ---
>
> Key: SPARK-26679
> URL: https://issues.apache.org/jira/browse/SPARK-26679
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-01-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750594#comment-16750594
 ] 

Hyukjin Kwon commented on SPARK-26679:
--

{quote}
There are two extreme cases: (1) an app which does a ton of stuff in python and 
uses a lot of python memory from user, but no usage of the sort machinery and 
(2) an app which uses the sort machinery within python, but makes very little 
use of allocating memory from user code. 
{quote}

Yes, but I was wondering if we have some Spark configurations for both cases 
in, for instance, somewhere core (correct me if I am mistaken). Current sort 
machinery in Python looks checking the given hard limit for a shared memory 
space (for instance, if somehow Python uses 300MB for other purpose, Spark will 
use 200MB and spills if that's set to 500MB). So, it guess it won't be a matter 
if user uses how much memory they use. The configuration 
{{spark.python.worker.memory}} looks initially introduced by the same targets 
but for different purposes.



> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> ---
>
> Key: SPARK-26679
> URL: https://issues.apache.org/jira/browse/SPARK-26679
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-01-23 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750660#comment-16750660
 ] 

Imran Rashid commented on SPARK-26679:
--

the java side has the same problem.  Spark has no idea how much memory the user 
needs for their own code.  That's why there is a spark.memory.fraction option.  
I think the new option "spark.executor.pyspark.memory" is close to 
"spark.executor.memory" (which sets -Xmx for each executor) on the java side.  
Most users don't mess w/ spark.memory.fraction, but its there to cover those 
extreme cases.

> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> ---
>
> Key: SPARK-26679
> URL: https://issues.apache.org/jira/browse/SPARK-26679
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-01-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750689#comment-16750689
 ] 

Hyukjin Kwon commented on SPARK-26679:
--

What does the current {{spark.memory.fraction}} exactly controls? In case of 
PySpark, that might be different since all the memory itself is for execution 
not for caching, storage, extra calculation or anything. It might ser/de some 
big instance but that's also counted as a part of execution.

> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> ---
>
> Key: SPARK-26679
> URL: https://issues.apache.org/jira/browse/SPARK-26679
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-01-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750694#comment-16750694
 ] 

Hyukjin Kwon commented on SPARK-26679:
--

Also, if it controls a ratio comparing to the whole available (or limited) 
memory space, the ratio here 
(https://github.com/apache/spark/pull/21977#issuecomment-456163212) should be 
fixed and configurable per {{spark.memory.fraction}} I guess, not the 
{{spark.python.worker.memory}}.

> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> ---
>
> Key: SPARK-26679
> URL: https://issues.apache.org/jira/browse/SPARK-26679
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-01-23 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750696#comment-16750696
 ] 

Felix Cheung commented on SPARK-26679:
--

I find the fraction configs very confusing and there are much misinformation 
about it and wrong hardcoded config. anyway. ...

> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> ---
>
> Key: SPARK-26679
> URL: https://issues.apache.org/jira/browse/SPARK-26679
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org