[jira] [Comment Edited] (SPARK-34115) Long runtime on many environment variables

2021-01-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267161#comment-17267161
 ] 

Hyukjin Kwon edited comment on SPARK-34115 at 1/18/21, 10:42 AM:
-

I see, okay. That makes sense. Can you try to open a PR? See also 
http://spark.apache.org/contributing.html


was (Author: hyukjin.kwon):
I see, okay. That makes sense. Can you try to open a PR? See also 
http://spark.apache.org/contributing.html
For a more conservative fix, we can switch from:
{{sys.env.contains("SPARK_TESTING") || sys.props.contains("spark.testing")}}
to:
{{sys.props.contains("spark.testing") || sys.env.contains("SPARK_TESTING")}}

If the lazy val approach does not work for any reason.

> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 2.4.7, 3.0.1
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
> Attachments: spark-bug-34115.tar.gz
>
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34115) Long runtime on many environment variables

2021-01-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267163#comment-17267163
 ] 

Hyukjin Kwon edited comment on SPARK-34115 at 1/18/21, 10:42 AM:
-

Let's see if tests pass first. and we can also get some more feedback from 
other people easily with a PR.


was (Author: hyukjin.kwon):
Let's see if tests pass first. and we can also get some more feedback from 
other people easily

> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 2.4.7, 3.0.1
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
> Attachments: spark-bug-34115.tar.gz
>
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34115) Long runtime on many environment variables

2021-01-15 Thread Norbert Schultz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265834#comment-17265834
 ] 

Norbert Schultz edited comment on SPARK-34115 at 1/15/21, 8:58 AM:
---

Added demonstration code based upon Spark 2.4.7

Call
 * show_fast.sh for regular running time
 * show_slow.sh for a lot of environment variables

Running time (locally)
 * fast: 4000ms
 * slow: 11303ms

The calculation done is completely useless but should give Spark SQL something 
to optimize

(Also tried Spark 3.0.1, showing the same behaviour) 


was (Author: nob13):
Added demonstration code based upon Spark 2.4.7

Call
 * show_fast.sh for regular running time
 * show_slow.sh for a lot of environment variables

Running time (locally)
 * fast: 4000ms
 * slow: 11303ms

The calculation done is completely useless but should give Spark SQL something 
to optimize

 

> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 2.4.7
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
> Attachments: spark-bug-34115.tar.gz
>
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org