[jira] [Commented] (SPARK-21943) When I use rest Api (/ applications / [app-id] / jobs / [job-id]) to view some of the jobs that are running jobs, the returned json information is missing the “descrip
[ https://issues.apache.org/jira/browse/SPARK-21943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160814#comment-16160814 ] Saisai Shao commented on SPARK-21943: - >From the code, it says that job description is gotten from the description of >last stage in this job. So looks like there's no last stage description for >job id "7", I think you need to track the code to know why, but I guess this >might be an intended behavior. > When I use rest Api (/ applications / [app-id] / jobs / [job-id]) to view > some of the jobs that are running jobs, the returned json information is > missing the “description” field. > --- > > Key: SPARK-21943 > URL: https://issues.apache.org/jira/browse/SPARK-21943 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.1 >Reporter: xianquan >Priority: Minor > Labels: rest_api > > When I use rest Api (/ applications / [app-id] / jobs / [job-id]) to view > some of the jobs that are running jobs,some of the returned json information > is missing the “description“” field. > The returned json results are as follows: > [{ > "jobId" : 7, > "name" : "run at AccessController.java:0", > "submissionTime" : "2017-09-07T09:44:53.632GMT", > "stageIds" : [ 19, 17, 18 ], > "jobGroup" : "cee1fb91-56bd-4d53-aed4-d409d21809da", > * "status" : "RUNNING*", > "numTasks" : 202, > "numActiveTasks" : 1, > "numCompletedTasks" : 0, > "numSkippedTasks" : 0, > "numFailedTasks" : 0, > "numActiveStages" : 2, > "numCompletedStages" : 0, > "numSkippedStages" : 0, > "numFailedStages" : 0 > }, > { > "jobId" : 6, > "name" : "run at AccessController.java:0", > "description" : "select * from test", > "submissionTime" : "2017-09-07T09:54:09.532GMT", > "stageIds" : [ 24 ], > "jobGroup" : "de8071d7-cb09-47af-a343-3d84946c2aff", > "status" : "RUNNING", > "numTasks" : 1, > "numActiveTasks" : 0, > "numCompletedTasks" : 0, > "numSkippedTasks" : 0, > "numFailedTasks" : 0, > "numActiveStages" : 1, > "numCompletedStages" : 0, > "numSkippedStages" : 0, > "numFailedStages" : 0 > }] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao closed SPARK-21906. --- Resolution: Not A Problem > No need to runAsSparkUser to switch UserGroupInformation in YARN mode > - > > Key: SPARK-21906 > URL: https://issues.apache.org/jira/browse/SPARK-21906 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.2.0 >Reporter: Kent Yao > > 1、The Yarn application‘s ugi is determined by the ugi launching it > 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have > already set {code:java} env("SPARK_USER") = > UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am > container context > {code:java} > def runAsSparkUser(func: () => Unit) { > val user = Utils.getCurrentUserName() // get the user itself > logDebug("running as user: " + user) > val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi > use itself > transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // > transfer its own credentials > ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft > def run: Unit = func() > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160758#comment-16160758 ] Apache Spark commented on SPARK-21972: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/19186 > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately > ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) > check the persistence level of the input dataset but not any of its parents. > These issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21972: Assignee: (was: Apache Spark) > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately > ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) > check the persistence level of the input dataset but not any of its parents. > These issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21972: Assignee: Apache Spark > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching >Assignee: Apache Spark > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately > ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) > check the persistence level of the input dataset but not any of its parents. > These issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}}. was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}}. > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately > ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) > check the persistence level of the input dataset but not any of its parents. > These issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160624#comment-16160624 ] Siddharth Murching commented on SPARK-21972: Work has already begun on this in this PR: [https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014] > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents. These > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}}. was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}} (algorithms will not try to persist uncached input). > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents. These > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160624#comment-16160624 ] Siddharth Murching edited comment on SPARK-21972 at 9/11/17 3:46 AM: - This issue is being worked on in this PR: [https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014] was (Author: siddharth murching): Work has already begun on this in this PR: [https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014] > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents. These > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}} (algorithms will not try to persist uncached input). was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}}. > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents. These > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}} (algorithms will > not try to persist uncached input). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}}. was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents; these > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]. This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (as described in [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; both of these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]. This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call `cache()` on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents; these > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]. > This ticket proposes adding a boolean `handlePersistence` param > (org.apache.spark.ml.param) to the abovementioned estimators so that users > can specify whether an ML algorithm should try to cache un-cached input data. > `handlePersistence` will be `true` by default, corresponding to existing > behavior (always persisting uncached input), but users can achieve > finer-grained control over input persistence by setting `handlePersistence` > to `false`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]. This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call `cache()` on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents; these > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean `handlePersistence` param > (org.apache.spark.ml.param) to the abovementioned estimators so that users > can specify whether an ML algorithm should try to cache un-cached input data. > `handlePersistence` will be `true` by default, corresponding to existing > behavior (always persisting uncached input), but users can achieve > finer-grained control over input persistence by setting `handlePersistence` > to `false`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
Siddharth Murching created SPARK-21972: -- Summary: Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param Key: SPARK-21972 URL: https://issues.apache.org/jira/browse/SPARK-21972 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.2.0 Reporter: Siddharth Murching Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (as described in [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; both of these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]. This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor
[ https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-21955: - Description: just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java @Override public void registerChannel(Channel channel, long streamId) { if (streams.containsKey(streamId)) { streams.get(streamId).associatedChannel = channel; } } this is only chance associatedChannel is set public void connectionTerminated(Channel channel) { // Close all streams which have been associated with the channel. for (Map.Entry entry: streams.entrySet()) { StreamState state = entry.getValue(); if (state.associatedChannel == channel) { streams.remove(entry.getKey()); // Release all remaining buffers. while (state.buffers.hasNext()) { state.buffers.next().release(); } } } this is only chance state.buffers is released. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, channel can not be set. So, we can see some leaked Buffer in OneForOneStreamManager !screenshot-1.png! if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Which may lead to OOM in NodeManager. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. We should set channel when we registerStream, so buffer can be released. was: just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java @Override public void registerChannel(Channel channel, long streamId) { if (streams.containsKey(streamId)) { streams.get(streamId).associatedChannel = channel; } } this is only chance associatedChannel is set public void connectionTerminated(Channel channel) { // Close all streams which have been associated with the channel. for (Map.Entry entry: streams.entrySet()) { StreamState state = entry.getValue(); if (state.associatedChannel == channel) { streams.remove(entry.getKey()); // Release all remaining buffers. while (state.buffers.hasNext()) { state.buffers.next().release(); } } } this is only chance state.buffers is released. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, channel can not be set. So, we can see some leaked Buffer in OneForOneStreamManager !screenshot-1.png! if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Which may lead to OOM in NodeManager. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. > OneForOneStreamManager may leak memory when network is poor > --- > > Key: SPARK-21955 > URL: https://issues.apache.org/jira/browse/SPARK-21955 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: hdp 2.4.2.0-258 > spark 1.6 >Reporter: poseidon > Attachments: screenshot-1.png > > > just in my way to know how stream , chunk , block works in netty found some > nasty case. > process OpenBlocks message registerStream Stream in OneForOneStreamManager > org.apache.spark.network.server.OneForOneStreamManager#registerStream > fill
[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor
[ https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-21955: - Description: just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java @Override public void registerChannel(Channel channel, long streamId) { if (streams.containsKey(streamId)) { streams.get(streamId).associatedChannel = channel; } } this is only chance associatedChannel is set public void connectionTerminated(Channel channel) { // Close all streams which have been associated with the channel. for (Map.Entry entry: streams.entrySet()) { StreamState state = entry.getValue(); if (state.associatedChannel == channel) { streams.remove(entry.getKey()); // Release all remaining buffers. while (state.buffers.hasNext()) { state.buffers.next().release(); } } } this is only chance state.buffers is released. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, channel can not be set. So, we can see some leaked Buffer in OneForOneStreamManager !screenshot-1.png! if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. was: just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, we can see some leaked Buffer in OneForOneStreamManager !attachment-name.jpg|thumbnail! if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. > OneForOneStreamManager may leak memory when network is poor > --- > > Key: SPARK-21955 > URL: https://issues.apache.org/jira/browse/SPARK-21955 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: hdp 2.4.2.0-258 > spark 1.6 >Reporter: poseidon > Attachments: screenshot-1.png > > > just in my way to know how stream , chunk , block works in netty found some > nasty case. > process OpenBlocks message registerStream Stream in OneForOneStreamManager > org.apache.spark.network.server.OneForOneStreamManager#registerStream > fill with streamState with app & buber > process ChunkFetchRequest registerChannel > org.apache.spark.network.server.OneForOneStreamManager#registerChannel > fill with streamState with channel > In > org.apache.spark.network.shuffle.OneForOneBlockFetcher#start > OpenBlocks -> ChunkFetchRequest come in sequnce. > spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java > @Override > public void registerChannel(Channel channel, long streamId) { > if (streams.containsKey(streamId)) { > streams.get(streamId).associatedChannel = channel; > } > } > this is only chance associatedChannel is set > public void connectionTerminated(Channel channel) { > // Close all streams which have been associated with the channel. > for (Map.Entry entry: streams.entrySet()) { > StreamState state = entry.getValue(); > if (state.associatedChannel == channel) { > streams.remove(entry.getKey()); > // Release al
[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor
[ https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-21955: - Description: just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java @Override public void registerChannel(Channel channel, long streamId) { if (streams.containsKey(streamId)) { streams.get(streamId).associatedChannel = channel; } } this is only chance associatedChannel is set public void connectionTerminated(Channel channel) { // Close all streams which have been associated with the channel. for (Map.Entry entry: streams.entrySet()) { StreamState state = entry.getValue(); if (state.associatedChannel == channel) { streams.remove(entry.getKey()); // Release all remaining buffers. while (state.buffers.hasNext()) { state.buffers.next().release(); } } } this is only chance state.buffers is released. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, channel can not be set. So, we can see some leaked Buffer in OneForOneStreamManager !screenshot-1.png! if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Which may lead to OOM in NodeManager. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. was: just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java @Override public void registerChannel(Channel channel, long streamId) { if (streams.containsKey(streamId)) { streams.get(streamId).associatedChannel = channel; } } this is only chance associatedChannel is set public void connectionTerminated(Channel channel) { // Close all streams which have been associated with the channel. for (Map.Entry entry: streams.entrySet()) { StreamState state = entry.getValue(); if (state.associatedChannel == channel) { streams.remove(entry.getKey()); // Release all remaining buffers. while (state.buffers.hasNext()) { state.buffers.next().release(); } } } this is only chance state.buffers is released. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, channel can not be set. So, we can see some leaked Buffer in OneForOneStreamManager !screenshot-1.png! if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. > OneForOneStreamManager may leak memory when network is poor > --- > > Key: SPARK-21955 > URL: https://issues.apache.org/jira/browse/SPARK-21955 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: hdp 2.4.2.0-258 > spark 1.6 >Reporter: poseidon > Attachments: screenshot-1.png > > > just in my way to know how stream , chunk , block works in netty found some > nasty case. > process OpenBlocks message registerStream Stream in OneForOneStreamManager > org.apache.spark.network.server.OneForOneStreamManager#registerStream > fill with streamState with app & buber > process ChunkFetchRequest registerChannel > org.apache.spark.network.server.
[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor
[ https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-21955: - Attachment: screenshot-1.png > OneForOneStreamManager may leak memory when network is poor > --- > > Key: SPARK-21955 > URL: https://issues.apache.org/jira/browse/SPARK-21955 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: hdp 2.4.2.0-258 > spark 1.6 >Reporter: poseidon > Attachments: screenshot-1.png > > > just in my way to know how stream , chunk , block works in netty found some > nasty case. > process OpenBlocks message registerStream Stream in OneForOneStreamManager > org.apache.spark.network.server.OneForOneStreamManager#registerStream > fill with streamState with app & buber > process ChunkFetchRequest registerChannel > org.apache.spark.network.server.OneForOneStreamManager#registerChannel > fill with streamState with channel > In > org.apache.spark.network.shuffle.OneForOneBlockFetcher#start > OpenBlocks -> ChunkFetchRequest come in sequnce. > If network down in OpenBlocks process, no more ChunkFetchRequest message > then. > So, we can see some leaked Buffer in OneForOneStreamManager > !attachment-name.jpg|thumbnail! > if > org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel > is not set, then after search the code , it will remain in memory forever. > Because the only way to release it was in channel close , or someone read the > last piece of block. > OneForOneStreamManager#registerStream we can set channel in this method, just > in case of this case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21971) Too many open files in Spark due to concurrent files being opened
[ https://issues.apache.org/jira/browse/SPARK-21971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21971: -- Affects Version/s: 2.2.0 > Too many open files in Spark due to concurrent files being opened > - > > Key: SPARK-21971 > URL: https://issues.apache.org/jira/browse/SPARK-21971 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Rajesh Balamohan >Priority: Minor > > When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it > consistently fails with "too many open files" exception. > {noformat} > O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in > 394 ms on machine111.xyz (executor 2) (189/200) > 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage > 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor > 6) (190/200) > 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage > 844.0 (TID 243904, machine1.xyz, executor 1): > java.nio.file.FileSystemException: > /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183: > Too many open files > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) > at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) > at java.nio.channels.FileChannel.open(FileChannel.java:287) > at java.nio.channels.FileChannel.open(FileChannel.java:335) > at > org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607) > at > org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169) > at > org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173) > {noformat} > Cluster was configured with multiple cores per executor. > Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which > causes large number of spills in larger dataset. With multiple cores per > executor, this reproduces easily. > {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for > all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file > in its constructor and closes the file later as a part of its close() call. > This causes too many open files issue. > Note that this is not a file leak, but more of concurrent files being open at > any given time depending on the dataset being processed. > One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" > so that fewer spill files are generated, but it is hard to determine the > sweetspot for all workload. Another option is to set ulimit to "unlimited" > for files, but that would not be a good production setting. It would be good > to consider reducing the number of concurrent > "UnsafeExternalSorter::getIterator". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21854) Python interface for MLOR summary
[ https://issues.apache.org/jira/browse/SPARK-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21854: Assignee: Apache Spark > Python interface for MLOR summary > - > > Key: SPARK-21854 > URL: https://issues.apache.org/jira/browse/SPARK-21854 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Apache Spark > > Python interface for MLOR summary -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21854) Python interface for MLOR summary
[ https://issues.apache.org/jira/browse/SPARK-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21854: Assignee: (was: Apache Spark) > Python interface for MLOR summary > - > > Key: SPARK-21854 > URL: https://issues.apache.org/jira/browse/SPARK-21854 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Python interface for MLOR summary -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21854) Python interface for MLOR summary
[ https://issues.apache.org/jira/browse/SPARK-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160616#comment-16160616 ] Apache Spark commented on SPARK-21854: -- User 'jmwdpk' has created a pull request for this issue: https://github.com/apache/spark/pull/19185 > Python interface for MLOR summary > - > > Key: SPARK-21854 > URL: https://issues.apache.org/jira/browse/SPARK-21854 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Python interface for MLOR summary -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor
[ https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-21955: - Description: just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, we can see some leaked Buffer in OneForOneStreamManager !attachment-name.jpg|thumbnail! if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. was: just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, we can see some leaked Buffer in OneForOneStreamManager if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. > OneForOneStreamManager may leak memory when network is poor > --- > > Key: SPARK-21955 > URL: https://issues.apache.org/jira/browse/SPARK-21955 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: hdp 2.4.2.0-258 > spark 1.6 >Reporter: poseidon > > just in my way to know how stream , chunk , block works in netty found some > nasty case. > process OpenBlocks message registerStream Stream in OneForOneStreamManager > org.apache.spark.network.server.OneForOneStreamManager#registerStream > fill with streamState with app & buber > process ChunkFetchRequest registerChannel > org.apache.spark.network.server.OneForOneStreamManager#registerChannel > fill with streamState with channel > In > org.apache.spark.network.shuffle.OneForOneBlockFetcher#start > OpenBlocks -> ChunkFetchRequest come in sequnce. > If network down in OpenBlocks process, no more ChunkFetchRequest message > then. > So, we can see some leaked Buffer in OneForOneStreamManager > !attachment-name.jpg|thumbnail! > if > org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel > is not set, then after search the code , it will remain in memory forever. > Because the only way to release it was in channel close , or someone read the > last piece of block. > OneForOneStreamManager#registerStream we can set channel in this method, just > in case of this case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor
[ https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-21955: - Description: just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, we can see some leaked Buffer in OneForOneStreamManager if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. was: just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, we can see some leaked Buffer in OneForOneStreamManager !attachment-name.jpg|thumbnail! if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. > OneForOneStreamManager may leak memory when network is poor > --- > > Key: SPARK-21955 > URL: https://issues.apache.org/jira/browse/SPARK-21955 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: hdp 2.4.2.0-258 > spark 1.6 >Reporter: poseidon > > just in my way to know how stream , chunk , block works in netty found some > nasty case. > process OpenBlocks message registerStream Stream in OneForOneStreamManager > org.apache.spark.network.server.OneForOneStreamManager#registerStream > fill with streamState with app & buber > process ChunkFetchRequest registerChannel > org.apache.spark.network.server.OneForOneStreamManager#registerChannel > fill with streamState with channel > In > org.apache.spark.network.shuffle.OneForOneBlockFetcher#start > OpenBlocks -> ChunkFetchRequest come in sequnce. > If network down in OpenBlocks process, no more ChunkFetchRequest message > then. > So, we can see some leaked Buffer in OneForOneStreamManager > if > org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel > is not set, then after search the code , it will remain in memory forever. > Because the only way to release it was in channel close , or someone read the > last piece of block. > OneForOneStreamManager#registerStream we can set channel in this method, just > in case of this case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160550#comment-16160550 ] xinzhang edited comment on SPARK-21067 at 9/11/17 2:02 AM: --- [~dricard] Thanks for your reply. So do we . Use the parquet . But another pro is when u use sql like "insert overwrite table a partition(pt='2') select" . It will also cause the thriftserver fail . Do you happen to have the same problem? Only happend with the table which use partitions . this all right when use parquet without partition. "insert overwrite table a select" was (Author: zhangxin0112zx): [~dricard] Thanks for your reply. So do we . Use the parquet . But another pro is when u use sql like "insert overwrite table a partition(pt='2') select" . It will also cause the thriftserver fail . Do you happen to have the same problem? > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160550#comment-16160550 ] xinzhang commented on SPARK-21067: -- [~dricard] Thanks for your reply. So do we . Use the parquet . But another pro is when u use sql like "insert overwrite table a partition(pt='2') select" . It will also cause the thriftserver fail . Do you happen to have the same problem? > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(Query
[jira] [Assigned] (SPARK-21971) Too many open files in Spark due to concurrent files being opened
[ https://issues.apache.org/jira/browse/SPARK-21971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21971: Assignee: (was: Apache Spark) > Too many open files in Spark due to concurrent files being opened > - > > Key: SPARK-21971 > URL: https://issues.apache.org/jira/browse/SPARK-21971 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Rajesh Balamohan >Priority: Minor > > When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it > consistently fails with "too many open files" exception. > {noformat} > O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in > 394 ms on machine111.xyz (executor 2) (189/200) > 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage > 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor > 6) (190/200) > 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage > 844.0 (TID 243904, machine1.xyz, executor 1): > java.nio.file.FileSystemException: > /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183: > Too many open files > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) > at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) > at java.nio.channels.FileChannel.open(FileChannel.java:287) > at java.nio.channels.FileChannel.open(FileChannel.java:335) > at > org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607) > at > org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169) > at > org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173) > {noformat} > Cluster was configured with multiple cores per executor. > Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which > causes large number of spills in larger dataset. With multiple cores per > executor, this reproduces easily. > {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for > all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file > in its constructor and closes the file later as a part of its close() call. > This causes too many open files issue. > Note that this is not a file leak, but more of concurrent files being open at > any given time depending on the dataset being processed. > One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" > so that fewer spill files are generated, but it is hard to determine the > sweetspot for all workload. Another option is to set ulimit to "unlimited" > for files, but that would not be a good production setting. It would be good > to consider reducing the number of concurrent > "UnsafeExternalSorter::getIterator". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21971) Too many open files in Spark due to concurrent files being opened
[ https://issues.apache.org/jira/browse/SPARK-21971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21971: Assignee: Apache Spark > Too many open files in Spark due to concurrent files being opened > - > > Key: SPARK-21971 > URL: https://issues.apache.org/jira/browse/SPARK-21971 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Rajesh Balamohan >Assignee: Apache Spark >Priority: Minor > > When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it > consistently fails with "too many open files" exception. > {noformat} > O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in > 394 ms on machine111.xyz (executor 2) (189/200) > 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage > 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor > 6) (190/200) > 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage > 844.0 (TID 243904, machine1.xyz, executor 1): > java.nio.file.FileSystemException: > /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183: > Too many open files > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) > at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) > at java.nio.channels.FileChannel.open(FileChannel.java:287) > at java.nio.channels.FileChannel.open(FileChannel.java:335) > at > org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607) > at > org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169) > at > org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173) > {noformat} > Cluster was configured with multiple cores per executor. > Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which > causes large number of spills in larger dataset. With multiple cores per > executor, this reproduces easily. > {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for > all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file > in its constructor and closes the file later as a part of its close() call. > This causes too many open files issue. > Note that this is not a file leak, but more of concurrent files being open at > any given time depending on the dataset being processed. > One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" > so that fewer spill files are generated, but it is hard to determine the > sweetspot for all workload. Another option is to set ulimit to "unlimited" > for files, but that would not be a good production setting. It would be good > to consider reducing the number of concurrent > "UnsafeExternalSorter::getIterator". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21971) Too many open files in Spark due to concurrent files being opened
[ https://issues.apache.org/jira/browse/SPARK-21971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160547#comment-16160547 ] Apache Spark commented on SPARK-21971: -- User 'rajeshbalamohan' has created a pull request for this issue: https://github.com/apache/spark/pull/19184 > Too many open files in Spark due to concurrent files being opened > - > > Key: SPARK-21971 > URL: https://issues.apache.org/jira/browse/SPARK-21971 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Rajesh Balamohan >Priority: Minor > > When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it > consistently fails with "too many open files" exception. > {noformat} > O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in > 394 ms on machine111.xyz (executor 2) (189/200) > 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage > 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor > 6) (190/200) > 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage > 844.0 (TID 243904, machine1.xyz, executor 1): > java.nio.file.FileSystemException: > /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183: > Too many open files > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) > at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) > at java.nio.channels.FileChannel.open(FileChannel.java:287) > at java.nio.channels.FileChannel.open(FileChannel.java:335) > at > org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607) > at > org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169) > at > org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173) > {noformat} > Cluster was configured with multiple cores per executor. > Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which > causes large number of spills in larger dataset. With multiple cores per > executor, this reproduces easily. > {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for > all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file > in its constructor and closes the file later as a part of its close() call. > This causes too many open files issue. > Note that this is not a file leak, but more of concurrent files being open at > any given time depending on the dataset being processed. > One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" > so that fewer spill files are generated, but it is hard to determine the > sweetspot for all workload. Another option is to set ulimit to "unlimited" > for files, but that would not be a good production setting. It would be good > to consider reducing the number of concurrent > "UnsafeExternalSorter::getIterator". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField
[ https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-20098: --- Assignee: Peter Szalai > DataType's typeName method returns with 'StructF' in case of StructField > > > Key: SPARK-20098 > URL: https://issues.apache.org/jira/browse/SPARK-20098 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Peter Szalai >Assignee: Peter Szalai > Fix For: 2.2.1, 2.3.0 > > > Currently, if you want to get the name of a DateType and the DateType is a > `StructField`, you get `StructF`. > http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21971) Too many open files in Spark due to concurrent files being opened
Rajesh Balamohan created SPARK-21971: Summary: Too many open files in Spark due to concurrent files being opened Key: SPARK-21971 URL: https://issues.apache.org/jira/browse/SPARK-21971 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.1.0 Reporter: Rajesh Balamohan Priority: Minor When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it consistently fails with "too many open files" exception. {noformat} O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in 394 ms on machine111.xyz (executor 2) (189/200) 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor 6) (190/200) 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage 844.0 (TID 243904, machine1.xyz, executor 1): java.nio.file.FileSystemException: /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183: Too many open files at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) at java.nio.channels.FileChannel.open(FileChannel.java:287) at java.nio.channels.FileChannel.open(FileChannel.java:335) at org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607) at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169) at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173) {noformat} Cluster was configured with multiple cores per executor. Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which causes large number of spills in larger dataset. With multiple cores per executor, this reproduces easily. {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file in its constructor and closes the file later as a part of its close() call. This causes too many open files issue. Note that this is not a file leak, but more of concurrent files being open at any given time depending on the dataset being processed. One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" so that fewer spill files are generated, but it is hard to determine the sweetspot for all workload. Another option is to set ulimit to "unlimited" for files, but that would not be a good production setting. It would be good to consider reducing the number of concurrent "UnsafeExternalSorter::getIterator". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances
[ https://issues.apache.org/jira/browse/SPARK-21960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160491#comment-16160491 ] Apache Spark commented on SPARK-21960: -- User 'karth295' has created a pull request for this issue: https://github.com/apache/spark/pull/19183 > Spark Streaming Dynamic Allocation should respect spark.executor.instances > -- > > Key: SPARK-21960 > URL: https://issues.apache.org/jira/browse/SPARK-21960 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Karthik Palaniappan >Priority: Minor > > This check enforces that spark.executor.instances (aka --num-executors) is > either unset or explicitly set to 0. > https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207 > If spark.executor.instances is unset, the check is fine, and the property > defaults to 2. Spark requests the cluster manager for 2 executors to start > with, then adds/removes executors appropriately. > However, if you explicitly set it to 0, the check also succeeds, but Spark > never asks the cluster manager for any executors. When running on YARN, I > repeatedly saw: > {code:java} > 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > {code} > I noticed that at least Google Dataproc and Ambari explicitly set > spark.executor.instances to a positive number, meaning that to use dynamic > allocation, you would have to edit spark-defaults.conf to remove the > property. That's obnoxious. > In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value > for --num-executors or --conf spark.executor.instances: > https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9 > It is much more reasonable for Streaming DRA to use spark.executor.instances, > just like Core DRA. I'll open a pull request to remove the check if there are > no objections. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances
[ https://issues.apache.org/jira/browse/SPARK-21960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21960: Assignee: Apache Spark > Spark Streaming Dynamic Allocation should respect spark.executor.instances > -- > > Key: SPARK-21960 > URL: https://issues.apache.org/jira/browse/SPARK-21960 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Karthik Palaniappan >Assignee: Apache Spark >Priority: Minor > > This check enforces that spark.executor.instances (aka --num-executors) is > either unset or explicitly set to 0. > https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207 > If spark.executor.instances is unset, the check is fine, and the property > defaults to 2. Spark requests the cluster manager for 2 executors to start > with, then adds/removes executors appropriately. > However, if you explicitly set it to 0, the check also succeeds, but Spark > never asks the cluster manager for any executors. When running on YARN, I > repeatedly saw: > {code:java} > 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > {code} > I noticed that at least Google Dataproc and Ambari explicitly set > spark.executor.instances to a positive number, meaning that to use dynamic > allocation, you would have to edit spark-defaults.conf to remove the > property. That's obnoxious. > In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value > for --num-executors or --conf spark.executor.instances: > https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9 > It is much more reasonable for Streaming DRA to use spark.executor.instances, > just like Core DRA. I'll open a pull request to remove the check if there are > no objections. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances
[ https://issues.apache.org/jira/browse/SPARK-21960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21960: Assignee: (was: Apache Spark) > Spark Streaming Dynamic Allocation should respect spark.executor.instances > -- > > Key: SPARK-21960 > URL: https://issues.apache.org/jira/browse/SPARK-21960 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Karthik Palaniappan >Priority: Minor > > This check enforces that spark.executor.instances (aka --num-executors) is > either unset or explicitly set to 0. > https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207 > If spark.executor.instances is unset, the check is fine, and the property > defaults to 2. Spark requests the cluster manager for 2 executors to start > with, then adds/removes executors appropriately. > However, if you explicitly set it to 0, the check also succeeds, but Spark > never asks the cluster manager for any executors. When running on YARN, I > repeatedly saw: > {code:java} > 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources > {code} > I noticed that at least Google Dataproc and Ambari explicitly set > spark.executor.instances to a positive number, meaning that to use dynamic > allocation, you would have to edit spark-defaults.conf to remove the > property. That's obnoxious. > In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value > for --num-executors or --conf spark.executor.instances: > https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9 > It is much more reasonable for Streaming DRA to use spark.executor.instances, > just like Core DRA. I'll open a pull request to remove the check if there are > no objections. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21970) Do a Project Wide Sweep for Redundant Throws Declarations
[ https://issues.apache.org/jira/browse/SPARK-21970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21970: Assignee: (was: Apache Spark) > Do a Project Wide Sweep for Redundant Throws Declarations > - > > Key: SPARK-21970 > URL: https://issues.apache.org/jira/browse/SPARK-21970 > Project: Spark > Issue Type: Bug > Components: Examples, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Armin Braun >Priority: Trivial > Labels: cleanup > > Unfortunately, redundant throws declarations are not caught by Checkstyle and > there are quite a few in the current Java codebase. > In one case `ShuffleExternalSorter#closeAndGetSpills` this hides some dead > code too. > I think it's worthwhile to do a sweep for these and remove them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21970) Do a Project Wide Sweep for Redundant Throws Declarations
[ https://issues.apache.org/jira/browse/SPARK-21970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21970: Assignee: Apache Spark > Do a Project Wide Sweep for Redundant Throws Declarations > - > > Key: SPARK-21970 > URL: https://issues.apache.org/jira/browse/SPARK-21970 > Project: Spark > Issue Type: Bug > Components: Examples, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Armin Braun >Assignee: Apache Spark >Priority: Trivial > Labels: cleanup > > Unfortunately, redundant throws declarations are not caught by Checkstyle and > there are quite a few in the current Java codebase. > In one case `ShuffleExternalSorter#closeAndGetSpills` this hides some dead > code too. > I think it's worthwhile to do a sweep for these and remove them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21970) Do a Project Wide Sweep for Redundant Throws Declarations
[ https://issues.apache.org/jira/browse/SPARK-21970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160470#comment-16160470 ] Apache Spark commented on SPARK-21970: -- User 'original-brownbear' has created a pull request for this issue: https://github.com/apache/spark/pull/19182 > Do a Project Wide Sweep for Redundant Throws Declarations > - > > Key: SPARK-21970 > URL: https://issues.apache.org/jira/browse/SPARK-21970 > Project: Spark > Issue Type: Bug > Components: Examples, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Armin Braun >Priority: Trivial > Labels: cleanup > > Unfortunately, redundant throws declarations are not caught by Checkstyle and > there are quite a few in the current Java codebase. > In one case `ShuffleExternalSorter#closeAndGetSpills` this hides some dead > code too. > I think it's worthwhile to do a sweep for these and remove them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21970) Do a Project Wide Sweep for Redundant Throws Declarations
Armin Braun created SPARK-21970: --- Summary: Do a Project Wide Sweep for Redundant Throws Declarations Key: SPARK-21970 URL: https://issues.apache.org/jira/browse/SPARK-21970 Project: Spark Issue Type: Bug Components: Examples, Spark Core, SQL Affects Versions: 2.3.0 Reporter: Armin Braun Priority: Trivial Unfortunately, redundant throws declarations are not caught by Checkstyle and there are quite a few in the current Java codebase. In one case `ShuffleExternalSorter#closeAndGetSpills` this hides some dead code too. I think it's worthwhile to do a sweep for these and remove them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21969) CommandUtils.updateTableStats should call refreshTable
[ https://issues.apache.org/jira/browse/SPARK-21969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Raducanu updated SPARK-21969: Description: The table is cached so even though statistics are removed, they will still be used by the existing sessions. {code} spark.range(100).write.saveAsTable("tab1") sql("analyze table tab1 compute statistics") sql("explain cost select distinct * from tab1").show(false) {code} Produces: {code} Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) {code} {code} spark.range(100).write.mode("append").saveAsTable("tab1") sql("explain cost select distinct * from tab1").show(false) {code} After append something, the same stats are used {code} Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) {code} Manually refreshing the table removes the stats {code} spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) sql("explain cost select distinct * from tab1").show(false) {code} {code} Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none) {code} was: The table is cached so even though statistics are removed, they will still be used by the existing sessions. {{ spark.range(100).write.saveAsTable("tab1") sql("analyze table tab1 compute statistics") sql("explain cost select distinct * from tab1").show(false) }} Produces: {{ Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) }} {{ spark.range(100).write.mode("append").saveAsTable("tab1") sql("explain cost select distinct * from tab1").show(false) }} After append something, the same stats are used {{ Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) }} Manually refreshing the table removes the stats {{ spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) sql("explain cost select distinct * from tab1").show(false) }} {{ Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none) }} > CommandUtils.updateTableStats should call refreshTable > -- > > Key: SPARK-21969 > URL: https://issues.apache.org/jira/browse/SPARK-21969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bogdan Raducanu > > The table is cached so even though statistics are removed, they will still be > used by the existing sessions. > {code} > spark.range(100).write.saveAsTable("tab1") > sql("analyze table tab1 compute statistics") > sql("explain cost select distinct * from tab1").show(false) > {code} > Produces: > {code} > Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, > hints=none) > {code} > {code} > spark.range(100).write.mode("append").saveAsTable("tab1") > sql("explain cost select distinct * from tab1").show(false) > {code} > After append something, the same stats are used > {code} > Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, > hints=none) > {code} > Manually refreshing the table removes the stats > {code} > spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) > sql("explain cost select distinct * from tab1").show(false) > {code} > {code} > Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21969) CommandUtils.updateTableStats should call refreshTable
[ https://issues.apache.org/jira/browse/SPARK-21969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Raducanu updated SPARK-21969: Description: The table is cached so even though statistics are removed, they will still be used by the existing sessions. {{ spark.range(100).write.saveAsTable("tab1") sql("analyze table tab1 compute statistics") sql("explain cost select distinct * from tab1").show(false) }} Produces: {{ Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) }} {{ spark.range(100).write.mode("append").saveAsTable("tab1") sql("explain cost select distinct * from tab1").show(false) }} After append something, the same stats are used {{ Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) }} Manually refreshing the table removes the stats {{ spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) sql("explain cost select distinct * from tab1").show(false) }} {{ Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none) }} was: The table is cached so even though statistics are removed, they will still be used by the existing sessions. {{code}} spark.range(100).write.saveAsTable("tab1") sql("analyze table tab1 compute statistics") sql("explain cost select distinct * from tab1").show(false) {{code}} Produces: {{code}} Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) {{code}} {{code}} spark.range(100).write.mode("append").saveAsTable("tab1") sql("explain cost select distinct * from tab1").show(false) {{code}} After append something, the same stats are used {{code}} Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) {{code}} Manually refreshing the table removes the stats {{code}} spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) sql("explain cost select distinct * from tab1").show(false) {{code}} {{code}} Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none) {{code}} > CommandUtils.updateTableStats should call refreshTable > -- > > Key: SPARK-21969 > URL: https://issues.apache.org/jira/browse/SPARK-21969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bogdan Raducanu > > The table is cached so even though statistics are removed, they will still be > used by the existing sessions. > {{ > spark.range(100).write.saveAsTable("tab1") > sql("analyze table tab1 compute statistics") > sql("explain cost select distinct * from tab1").show(false) > }} > Produces: > {{ > Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, > hints=none) > }} > {{ > spark.range(100).write.mode("append").saveAsTable("tab1") > sql("explain cost select distinct * from tab1").show(false) > }} > After append something, the same stats are used > {{ > Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, > hints=none) > }} > Manually refreshing the table removes the stats > {{ > spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) > sql("explain cost select distinct * from tab1").show(false) > }} > {{ > Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none) > }} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21969) CommandUtils.updateTableStats should call refreshTable
Bogdan Raducanu created SPARK-21969: --- Summary: CommandUtils.updateTableStats should call refreshTable Key: SPARK-21969 URL: https://issues.apache.org/jira/browse/SPARK-21969 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Bogdan Raducanu The table is cached so even though statistics are removed, they will still be used by the existing sessions. {{code}} spark.range(100).write.saveAsTable("tab1") sql("analyze table tab1 compute statistics") sql("explain cost select distinct * from tab1").show(false) {{code}} Produces: {{code}} Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) {{code}} {{code}} spark.range(100).write.mode("append").saveAsTable("tab1") sql("explain cost select distinct * from tab1").show(false) {{code}} After append something, the same stats are used {{code}} Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none) {{code}} Manually refreshing the table removes the stats {{code}} spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) sql("explain cost select distinct * from tab1").show(false) {{code}} {{code}} Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none) {{code}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21968) Improved KernelDensity support
Brian created SPARK-21968: - Summary: Improved KernelDensity support Key: SPARK-21968 URL: https://issues.apache.org/jira/browse/SPARK-21968 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.2.0 Reporter: Brian Related to SPARK-7753. The KernelDensity API still does not provide a way to specify a kernel as described in the 7753 ticket, and requires the client to calculate their own optimal bandwidth. Specifying a kernel could be something like: def setKernel(kernel: Function2[Double,Double]): KernelDensity.this.type There could be something providing the user with a few options for kernels they could pass here so they don't need to implement each kernel themselves. Here are some example kernels: https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use functions could also be provided to get more optimal bandwidth settings without the user needing to calculate it themselves, e.g. the "rule of thumb" and/or "solve the equation" bandwidth described here: https://en.wikipedia.org/wiki/Kernel_density_estimation#Bandwidth_selection -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160420#comment-16160420 ] Felix Cheung commented on SPARK-20684: -- I"m making this primary JIRA for tracking this issue and keeping this open. Please see the discussion in the PR. > expose createOrReplaceGlobalTempView/createGlobalTempView and > dropGlobalTempView in SparkR > -- > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20684: - Summary: expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR (was: expose createGlobalTempView and dropGlobalTempView in SparkR) > expose createOrReplaceGlobalTempView/createGlobalTempView and > dropGlobalTempView in SparkR > -- > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reopened SPARK-20684: -- > expose createGlobalTempView and dropGlobalTempView in SparkR > > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted
[ https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160388#comment-16160388 ] Takeshi Yamamuro commented on SPARK-18591: -- I just kindly give a head-up for the discussion on this thread; since we've already have LogicalPlanVisitor, we might easily realise bottom-up transformation in SparkStrategies like https://github.com/apache/spark/compare/master...maropu:SPARK-18591. I'm not sure this is the good timing now (cuz, probably, I think many committers and qualified developers spending much time on Dataset API v2 reviews and others) to change the transformation way though, I think it'd be better to modify this in future because the bottom-up transformation makes catalyst easily select better physical plans based on bottom sub-tree condition (costs and partition/sort conditions), e.g., we could easily fix SPARK-12978 and this ticket. cc: [~smilegator] > Replace hash-based aggregates with sort-based ones if inputs already sorted > --- > > Key: SPARK-18591 > URL: https://issues.apache.org/jira/browse/SPARK-18591 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently uses sort-based aggregates only in limited condition; the > cases where spark cannot use partial aggregates and hash-based ones. > However, if input ordering has already satisfied the requirements of > sort-based aggregates, it seems sort-based ones are faster than the other. > {code} > ./bin/spark-shell --conf spark.sql.shuffle.partitions=1 > val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS > value").sort($"key").cache > def timer[R](block: => R): R = { > val t0 = System.nanoTime() > val result = block > val t1 = System.nanoTime() > println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s") > result > } > timer { > df.groupBy("key").count().count > } > // codegen'd hash aggregate > Elapsed time: 7.116962977s > // non-codegen'd sort aggregarte > Elapsed time: 3.088816662s > {code} > If codegen'd sort-based aggregates are supported in SPARK-16844, this seems > to make the performance gap bigger; > {code} > - codegen'd sort aggregate > Elapsed time: 1.645234684s > {code} > Therefore, it'd be better to use sort-based ones in this case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()
[ https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21907: Assignee: Apache Spark > NullPointerException in UnsafeExternalSorter.spill() > > > Key: SPARK-21907 > URL: https://issues.apache.org/jira/browse/SPARK-21907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski >Assignee: Apache Spark > > I see NPE during sorting with the following stacktrace: > {code} > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker
[jira] [Assigned] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()
[ https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21907: Assignee: (was: Apache Spark) > NullPointerException in UnsafeExternalSorter.spill() > > > Key: SPARK-21907 > URL: https://issues.apache.org/jira/browse/SPARK-21907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski > > I see NPE during sorting with the following stacktrace: > {code} > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.j
[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()
[ https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160364#comment-16160364 ] Apache Spark commented on SPARK-21907: -- User 'eyalfa' has created a pull request for this issue: https://github.com/apache/spark/pull/19181 > NullPointerException in UnsafeExternalSorter.spill() > > > Key: SPARK-21907 > URL: https://issues.apache.org/jira/browse/SPARK-21907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski > > I see NPE during sorting with the following stacktrace: > {code} > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(Th
[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()
[ https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160366#comment-16160366 ] Eyal Farago commented on SPARK-21907: - opened PR: https://github.com/apache/spark/pull/19181 > NullPointerException in UnsafeExternalSorter.spill() > > > Key: SPARK-21907 > URL: https://issues.apache.org/jira/browse/SPARK-21907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski > > I see NPE during sorting with the following stacktrace: > {code} > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.u
[jira] [Assigned] (SPARK-21967) org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance
[ https://issues.apache.org/jira/browse/SPARK-21967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21967: Assignee: (was: Apache Spark) > org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at > a Time for Better Performance > -- > > Key: SPARK-21967 > URL: https://issues.apache.org/jira/browse/SPARK-21967 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Armin Braun >Priority: Minor > Labels: perfomance > > org.apache.spark.unsafe.types.UTF8String#compareTo contains the following > TODO: > {code} > int len = Math.min(numBytes, other.numBytes); > // TODO: compare 8 bytes as unsigned long > for (int i = 0; i < len; i ++) { > // In UTF-8, the byte should be unsigned, so we should compare them as > unsigned int. > {code} > The todo should be resolved by comparing the maximum number of 64bit words > possible in this method, before falling back to unsigned int comparison. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21967) org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance
[ https://issues.apache.org/jira/browse/SPARK-21967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160338#comment-16160338 ] Apache Spark commented on SPARK-21967: -- User 'original-brownbear' has created a pull request for this issue: https://github.com/apache/spark/pull/19180 > org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at > a Time for Better Performance > -- > > Key: SPARK-21967 > URL: https://issues.apache.org/jira/browse/SPARK-21967 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Armin Braun >Priority: Minor > Labels: perfomance > > org.apache.spark.unsafe.types.UTF8String#compareTo contains the following > TODO: > {code} > int len = Math.min(numBytes, other.numBytes); > // TODO: compare 8 bytes as unsigned long > for (int i = 0; i < len; i ++) { > // In UTF-8, the byte should be unsigned, so we should compare them as > unsigned int. > {code} > The todo should be resolved by comparing the maximum number of 64bit words > possible in this method, before falling back to unsigned int comparison. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21967) org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance
[ https://issues.apache.org/jira/browse/SPARK-21967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21967: Assignee: Apache Spark > org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at > a Time for Better Performance > -- > > Key: SPARK-21967 > URL: https://issues.apache.org/jira/browse/SPARK-21967 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Armin Braun >Assignee: Apache Spark >Priority: Minor > Labels: perfomance > > org.apache.spark.unsafe.types.UTF8String#compareTo contains the following > TODO: > {code} > int len = Math.min(numBytes, other.numBytes); > // TODO: compare 8 bytes as unsigned long > for (int i = 0; i < len; i ++) { > // In UTF-8, the byte should be unsigned, so we should compare them as > unsigned int. > {code} > The todo should be resolved by comparing the maximum number of 64bit words > possible in this method, before falling back to unsigned int comparison. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21967) org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance
Armin Braun created SPARK-21967: --- Summary: org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance Key: SPARK-21967 URL: https://issues.apache.org/jira/browse/SPARK-21967 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Armin Braun Priority: Minor org.apache.spark.unsafe.types.UTF8String#compareTo contains the following TODO: {code} int len = Math.min(numBytes, other.numBytes); // TODO: compare 8 bytes as unsigned long for (int i = 0; i < len; i ++) { // In UTF-8, the byte should be unsigned, so we should compare them as unsigned int. {code} The todo should be resolved by comparing the maximum number of 64bit words possible in this method, before falling back to unsigned int comparison. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21966) ResolveMissingReference rule should not ignore the Union operator
[ https://issues.apache.org/jira/browse/SPARK-21966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21966: Assignee: (was: Apache Spark) > ResolveMissingReference rule should not ignore the Union operator > - > > Key: SPARK-21966 > URL: https://issues.apache.org/jira/browse/SPARK-21966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu > > The below example will fail. > {code:java} > val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b") > val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b") > val df3 = df1.cube("a").sum("b") > val df4 = df2.cube("a").sum("b") > val df5 = df3.union(df4).filter("grouping_id()=0").show() > {code} > It will thow an Exception: > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve '`spark_grouping_id`' given input columns: [a, sum(b)];; > 'Filter ('spark_grouping_id > 0) > +- Union >:- Aggregate [a#17, spark_grouping_id#15], [a#17, sum(cast(b#6 as bigint)) > AS sum(b)#14L] >: +- Expand [List(a#5, b#6, a#16, 0), List(a#5, b#6, null, 1)], [a#5, > b#6, a#17, spark_grouping_id#15] >: +- Project [a#5, b#6, a#5 AS a#16] >:+- Project [_1#0 AS a#5, _2#1 AS b#6] >: +- LocalRelation [_1#0, _2#1] >+- Aggregate [a#30, spark_grouping_id#28], [a#30, sum(cast(b#6 as bigint)) > AS sum(b)#27L] > +- Expand [List(a#5, b#6, a#29, 0), List(a#5, b#6, null, 1)], [a#5, > b#6, a#30, spark_grouping_id#28] > +- Project [a#5, b#6, a#5 AS a#29] > +- Project [_1#0 AS a#5, _2#1 AS b#6] >+- LocalRelation [_1#0, _2#1] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:72) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:71) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecution.(QueryExecution.scal
[jira] [Commented] (SPARK-21966) ResolveMissingReference rule should not ignore the Union operator
[ https://issues.apache.org/jira/browse/SPARK-21966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160279#comment-16160279 ] Apache Spark commented on SPARK-21966: -- User 'DonnyZone' has created a pull request for this issue: https://github.com/apache/spark/pull/19178 > ResolveMissingReference rule should not ignore the Union operator > - > > Key: SPARK-21966 > URL: https://issues.apache.org/jira/browse/SPARK-21966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu > > The below example will fail. > {code:java} > val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b") > val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b") > val df3 = df1.cube("a").sum("b") > val df4 = df2.cube("a").sum("b") > val df5 = df3.union(df4).filter("grouping_id()=0").show() > {code} > It will thow an Exception: > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve '`spark_grouping_id`' given input columns: [a, sum(b)];; > 'Filter ('spark_grouping_id > 0) > +- Union >:- Aggregate [a#17, spark_grouping_id#15], [a#17, sum(cast(b#6 as bigint)) > AS sum(b)#14L] >: +- Expand [List(a#5, b#6, a#16, 0), List(a#5, b#6, null, 1)], [a#5, > b#6, a#17, spark_grouping_id#15] >: +- Project [a#5, b#6, a#5 AS a#16] >:+- Project [_1#0 AS a#5, _2#1 AS b#6] >: +- LocalRelation [_1#0, _2#1] >+- Aggregate [a#30, spark_grouping_id#28], [a#30, sum(cast(b#6 as bigint)) > AS sum(b)#27L] > +- Expand [List(a#5, b#6, a#29, 0), List(a#5, b#6, null, 1)], [a#5, > b#6, a#30, spark_grouping_id#28] > +- Project [a#5, b#6, a#5 AS a#29] > +- Project [_1#0 AS a#5, _2#1 AS b#6] >+- LocalRelation [_1#0, _2#1] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:72) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:71) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecution.opt
[jira] [Assigned] (SPARK-21966) ResolveMissingReference rule should not ignore the Union operator
[ https://issues.apache.org/jira/browse/SPARK-21966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21966: Assignee: Apache Spark > ResolveMissingReference rule should not ignore the Union operator > - > > Key: SPARK-21966 > URL: https://issues.apache.org/jira/browse/SPARK-21966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu >Assignee: Apache Spark > > The below example will fail. > {code:java} > val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b") > val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b") > val df3 = df1.cube("a").sum("b") > val df4 = df2.cube("a").sum("b") > val df5 = df3.union(df4).filter("grouping_id()=0").show() > {code} > It will thow an Exception: > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve '`spark_grouping_id`' given input columns: [a, sum(b)];; > 'Filter ('spark_grouping_id > 0) > +- Union >:- Aggregate [a#17, spark_grouping_id#15], [a#17, sum(cast(b#6 as bigint)) > AS sum(b)#14L] >: +- Expand [List(a#5, b#6, a#16, 0), List(a#5, b#6, null, 1)], [a#5, > b#6, a#17, spark_grouping_id#15] >: +- Project [a#5, b#6, a#5 AS a#16] >:+- Project [_1#0 AS a#5, _2#1 AS b#6] >: +- LocalRelation [_1#0, _2#1] >+- Aggregate [a#30, spark_grouping_id#28], [a#30, sum(cast(b#6 as bigint)) > AS sum(b)#27L] > +- Expand [List(a#5, b#6, a#29, 0), List(a#5, b#6, null, 1)], [a#5, > b#6, a#30, spark_grouping_id#28] > +- Project [a#5, b#6, a#5 AS a#29] > +- Project [_1#0 AS a#5, _2#1 AS b#6] >+- LocalRelation [_1#0, _2#1] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:72) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:71) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecu
[jira] [Created] (SPARK-21966) ResolveMissingReference rule should not ignore the Union operator
Feng Zhu created SPARK-21966: Summary: ResolveMissingReference rule should not ignore the Union operator Key: SPARK-21966 URL: https://issues.apache.org/jira/browse/SPARK-21966 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.1, 2.1.0 Reporter: Feng Zhu The below example will fail. {code:java} val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b") val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b") val df3 = df1.cube("a").sum("b") val df4 = df2.cube("a").sum("b") val df5 = df3.union(df4).filter("grouping_id()=0").show() {code} It will thow an Exception: {code:java} Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`spark_grouping_id`' given input columns: [a, sum(b)];; 'Filter ('spark_grouping_id > 0) +- Union :- Aggregate [a#17, spark_grouping_id#15], [a#17, sum(cast(b#6 as bigint)) AS sum(b)#14L] : +- Expand [List(a#5, b#6, a#16, 0), List(a#5, b#6, null, 1)], [a#5, b#6, a#17, spark_grouping_id#15] : +- Project [a#5, b#6, a#5 AS a#16] :+- Project [_1#0 AS a#5, _2#1 AS b#6] : +- LocalRelation [_1#0, _2#1] +- Aggregate [a#30, spark_grouping_id#28], [a#30, sum(cast(b#6 as bigint)) AS sum(b)#27L] +- Expand [List(a#5, b#6, a#29, 0), List(a#5, b#6, null, 1)], [a#5, b#6, a#30, spark_grouping_id#28] +- Project [a#5, b#6, a#5 AS a#29] +- Project [_1#0 AS a#5, _2#1 AS b#6] +- LocalRelation [_1#0, _2#1] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:71) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution.(QueryExecution.scala:79) at org.apache.spark.sql.internal.SessionState.executePlan(SessionState.scala:169) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.apply(Dataset.scala:58) at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2827) at org.apache.spark.sql.Dataset.filter(Dat
[jira] [Closed] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
[ https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang closed SPARK-21965. --- Resolution: Duplicate > Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR > --- > > Key: SPARK-21965 > URL: https://issues.apache.org/jira/browse/SPARK-21965 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField
[ https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160276#comment-16160276 ] Hyukjin Kwon commented on SPARK-20098: -- [~jerryshao], I am sorry for asking this again but would you mind if i ask set the role for this user - https://issues.apache.org/jira/secure/ViewProfile.jspa?name=szalai1 and assign this user to this JIRA? > DataType's typeName method returns with 'StructF' in case of StructField > > > Key: SPARK-20098 > URL: https://issues.apache.org/jira/browse/SPARK-20098 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Peter Szalai > Fix For: 2.2.1, 2.3.0 > > > Currently, if you want to get the name of a DateType and the DateType is a > `StructField`, you get `StructF`. > http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField
[ https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20098. -- Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Issue resolved by pull request 17435 [https://github.com/apache/spark/pull/17435] > DataType's typeName method returns with 'StructF' in case of StructField > > > Key: SPARK-20098 > URL: https://issues.apache.org/jira/browse/SPARK-20098 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Peter Szalai > Fix For: 2.2.1, 2.3.0 > > > Currently, if you want to get the name of a DateType and the DateType is a > `StructField`, you get `StructF`. > http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org