[jira] [Updated] (SPARK-31935) Hadoop file system config should be effective in data source options
[ https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31935: Fix Version/s: 2.4.7 > Hadoop file system config should be effective in data source options > - > > Key: SPARK-31935 > URL: https://issues.apache.org/jira/browse/SPARK-31935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > Data source options should be propagated into the hadoop configuration of > method `checkAndGlobPathIfNecessary` > From org.apache.hadoop.fs.FileSystem.java: > {code:java} > public static FileSystem get(URI uri, Configuration conf) throws > IOException { > String scheme = uri.getScheme(); > String authority = uri.getAuthority(); > if (scheme == null && authority == null) { // use default FS > return get(conf); > } > if (scheme != null && authority == null) { // no authority > URI defaultUri = getDefaultUri(conf); > if (scheme.equals(defaultUri.getScheme())// if scheme matches > default > && defaultUri.getAuthority() != null) { // & default has authority > return get(defaultUri, conf); // return default > } > } > > String disableCacheName = String.format("fs.%s.impl.disable.cache", > scheme); > if (conf.getBoolean(disableCacheName, false)) { > return createFileSystem(uri, conf); > } > return CACHE.get(uri, conf); > } > {code} > With this, we can specify URI schema and authority related configurations for > scanning file systems. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32144) Retain EXTERNAL property in hive table properties
ulysses you created SPARK-32144: --- Summary: Retain EXTERNAL property in hive table properties Key: SPARK-32144 URL: https://issues.apache.org/jira/browse/SPARK-32144 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: ulysses you -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite
[ https://issues.apache.org/jira/browse/SPARK-32142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32142. -- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 28955 [https://github.com/apache/spark/pull/28955] > Keep the original tests and codes to avoid potential conflicts in dev in > ParquetFilterSuite and ParquetIOSuite > -- > > Key: SPARK-32142 > URL: https://issues.apache.org/jira/browse/SPARK-32142 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > There are some unnecessary diff at > https://github.com/apache/spark/pull/27728#discussion_r397655390, which could > cause potential confusions. It will make more diff when other people match > the codes with it. > It might be best to address the comment and make the diff minimized against > other branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite
[ https://issues.apache.org/jira/browse/SPARK-32142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32142: Assignee: Hyukjin Kwon > Keep the original tests and codes to avoid potential conflicts in dev in > ParquetFilterSuite and ParquetIOSuite > -- > > Key: SPARK-32142 > URL: https://issues.apache.org/jira/browse/SPARK-32142 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > There are some unnecessary diff at > https://github.com/apache/spark/pull/27728#discussion_r397655390, which could > cause potential confusions. It will make more diff when other people match > the codes with it. > It might be best to address the comment and make the diff minimized against > other branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32131) Fix AnalysisException messages at UNION/EXCEPT/MINUS operations
[ https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32131: -- Summary: Fix AnalysisException messages at UNION/EXCEPT/MINUS operations (was: union and set operations have wrong exception infomation) > Fix AnalysisException messages at UNION/EXCEPT/MINUS operations > --- > > Key: SPARK-32131 > URL: https://issues.apache.org/jira/browse/SPARK-32131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: philipse >Priority: Minor > > Union and set operations can only be performed on tables with the compatible > column types,while when we have more than two column, the warning messages > will have wrong column index.Steps to reproduce. > Step1:prepare test data > {code:java} > drop table if exists test1; > drop table if exists test2; > drop table if exists test3; > create table if not exists test1(id int, age int, name timestamp); > create table if not exists test2(id int, age timestamp, name timestamp); > create table if not exists test3(id int, age int, name int); > insert into test1 select 1,2,'2020-01-01 01:01:01'; > insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; > insert into test3 select 1,3,4; > {code} > Step2:do query: > {code:java} > Query1: > select * from test1 except select * from test2; > Result1: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. timestamp <> int at the second > column of the second table;; 'Except false :- Project [id#620, age#621, > name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- > SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, > name#625] (state=,code=0) > Query2: > select * from test1 except select * from test3; > Result2: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the 2th > column of the second table;; 'Except false :- Project [id#632, age#633, > name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- > SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, > name#637] (state=,code=0) > {code} > the result of query1 is correct, while query2 have the wrong errors,it should > be the third column > Here has the wrong column index. > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *2th* > column of the second table+ > We may need to change to the following > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *third* > column of the second table+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32131) union and set operations have wrong exception infomation
[ https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149108#comment-17149108 ] Dongjoon Hyun commented on SPARK-32131: --- I also verified that this bug exists at 2.1.3 ~ 2.3.7 and updated the affected versions. > union and set operations have wrong exception infomation > > > Key: SPARK-32131 > URL: https://issues.apache.org/jira/browse/SPARK-32131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: philipse >Priority: Minor > > Union and set operations can only be performed on tables with the compatible > column types,while when we have more than two column, the warning messages > will have wrong column index.Steps to reproduce. > Step1:prepare test data > {code:java} > drop table if exists test1; > drop table if exists test2; > drop table if exists test3; > create table if not exists test1(id int, age int, name timestamp); > create table if not exists test2(id int, age timestamp, name timestamp); > create table if not exists test3(id int, age int, name int); > insert into test1 select 1,2,'2020-01-01 01:01:01'; > insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; > insert into test3 select 1,3,4; > {code} > Step2:do query: > {code:java} > Query1: > select * from test1 except select * from test2; > Result1: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. timestamp <> int at the second > column of the second table;; 'Except false :- Project [id#620, age#621, > name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- > SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, > name#625] (state=,code=0) > Query2: > select * from test1 except select * from test3; > Result2: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the 2th > column of the second table;; 'Except false :- Project [id#632, age#633, > name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- > SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, > name#637] (state=,code=0) > {code} > the result of query1 is correct, while query2 have the wrong errors,it should > be the third column > Here has the wrong column index. > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *2th* > column of the second table+ > We may need to change to the following > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *third* > column of the second table+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32131) union and set operations have wrong exception infomation
[ https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32131: -- Affects Version/s: 2.1.3 > union and set operations have wrong exception infomation > > > Key: SPARK-32131 > URL: https://issues.apache.org/jira/browse/SPARK-32131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: philipse >Priority: Minor > > Union and set operations can only be performed on tables with the compatible > column types,while when we have more than two column, the warning messages > will have wrong column index.Steps to reproduce. > Step1:prepare test data > {code:java} > drop table if exists test1; > drop table if exists test2; > drop table if exists test3; > create table if not exists test1(id int, age int, name timestamp); > create table if not exists test2(id int, age timestamp, name timestamp); > create table if not exists test3(id int, age int, name int); > insert into test1 select 1,2,'2020-01-01 01:01:01'; > insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; > insert into test3 select 1,3,4; > {code} > Step2:do query: > {code:java} > Query1: > select * from test1 except select * from test2; > Result1: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. timestamp <> int at the second > column of the second table;; 'Except false :- Project [id#620, age#621, > name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- > SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, > name#625] (state=,code=0) > Query2: > select * from test1 except select * from test3; > Result2: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the 2th > column of the second table;; 'Except false :- Project [id#632, age#633, > name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- > SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, > name#637] (state=,code=0) > {code} > the result of query1 is correct, while query2 have the wrong errors,it should > be the third column > Here has the wrong column index. > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *2th* > column of the second table+ > We may need to change to the following > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *third* > column of the second table+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32131) union and set operations have wrong exception infomation
[ https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32131: -- Affects Version/s: 2.2.3 > union and set operations have wrong exception infomation > > > Key: SPARK-32131 > URL: https://issues.apache.org/jira/browse/SPARK-32131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: philipse >Priority: Minor > > Union and set operations can only be performed on tables with the compatible > column types,while when we have more than two column, the warning messages > will have wrong column index.Steps to reproduce. > Step1:prepare test data > {code:java} > drop table if exists test1; > drop table if exists test2; > drop table if exists test3; > create table if not exists test1(id int, age int, name timestamp); > create table if not exists test2(id int, age timestamp, name timestamp); > create table if not exists test3(id int, age int, name int); > insert into test1 select 1,2,'2020-01-01 01:01:01'; > insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; > insert into test3 select 1,3,4; > {code} > Step2:do query: > {code:java} > Query1: > select * from test1 except select * from test2; > Result1: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. timestamp <> int at the second > column of the second table;; 'Except false :- Project [id#620, age#621, > name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- > SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, > name#625] (state=,code=0) > Query2: > select * from test1 except select * from test3; > Result2: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the 2th > column of the second table;; 'Except false :- Project [id#632, age#633, > name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- > SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, > name#637] (state=,code=0) > {code} > the result of query1 is correct, while query2 have the wrong errors,it should > be the third column > Here has the wrong column index. > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *2th* > column of the second table+ > We may need to change to the following > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *third* > column of the second table+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32131) union and set operations have wrong exception infomation
[ https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32131: -- Affects Version/s: 2.3.4 > union and set operations have wrong exception infomation > > > Key: SPARK-32131 > URL: https://issues.apache.org/jira/browse/SPARK-32131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.6, 3.0.0 >Reporter: philipse >Priority: Minor > > Union and set operations can only be performed on tables with the compatible > column types,while when we have more than two column, the warning messages > will have wrong column index.Steps to reproduce. > Step1:prepare test data > {code:java} > drop table if exists test1; > drop table if exists test2; > drop table if exists test3; > create table if not exists test1(id int, age int, name timestamp); > create table if not exists test2(id int, age timestamp, name timestamp); > create table if not exists test3(id int, age int, name int); > insert into test1 select 1,2,'2020-01-01 01:01:01'; > insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; > insert into test3 select 1,3,4; > {code} > Step2:do query: > {code:java} > Query1: > select * from test1 except select * from test2; > Result1: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. timestamp <> int at the second > column of the second table;; 'Except false :- Project [id#620, age#621, > name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- > SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, > name#625] (state=,code=0) > Query2: > select * from test1 except select * from test3; > Result2: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the 2th > column of the second table;; 'Except false :- Project [id#632, age#633, > name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- > SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, > name#637] (state=,code=0) > {code} > the result of query1 is correct, while query2 have the wrong errors,it should > be the third column > Here has the wrong column index. > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *2th* > column of the second table+ > We may need to change to the following > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *third* > column of the second table+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32131) union and set operations have wrong exception infomation
[ https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32131: -- Affects Version/s: 2.4.6 > union and set operations have wrong exception infomation > > > Key: SPARK-32131 > URL: https://issues.apache.org/jira/browse/SPARK-32131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: philipse >Priority: Minor > > Union and set operations can only be performed on tables with the compatible > column types,while when we have more than two column, the warning messages > will have wrong column index.Steps to reproduce. > Step1:prepare test data > {code:java} > drop table if exists test1; > drop table if exists test2; > drop table if exists test3; > create table if not exists test1(id int, age int, name timestamp); > create table if not exists test2(id int, age timestamp, name timestamp); > create table if not exists test3(id int, age int, name int); > insert into test1 select 1,2,'2020-01-01 01:01:01'; > insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; > insert into test3 select 1,3,4; > {code} > Step2:do query: > {code:java} > Query1: > select * from test1 except select * from test2; > Result1: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. timestamp <> int at the second > column of the second table;; 'Except false :- Project [id#620, age#621, > name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- > SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, > name#625] (state=,code=0) > Query2: > select * from test1 except select * from test3; > Result2: > Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the 2th > column of the second table;; 'Except false :- Project [id#632, age#633, > name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation > `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- > SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, > name#637] (state=,code=0) > {code} > the result of query1 is correct, while query2 have the wrong errors,it should > be the third column > Here has the wrong column index. > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *2th* > column of the second table+ > We may need to change to the following > +Error: org.apache.spark.sql.AnalysisException: Except can only be performed > on tables with the compatible column types. int <> timestamp at the *third* > column of the second table+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties
[ https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149104#comment-17149104 ] L. C. Hsieh commented on SPARK-32136: - Thanks for the ping. Will look at this. > Spark producing incorrect groupBy results when key is a struct with nullable > properties > --- > > Key: SPARK-32136 > URL: https://issues.apache.org/jira/browse/SPARK-32136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jason Moore >Priority: Major > > I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm > noticing a behaviour change in a particular aggregation we're doing, and I > think I've tracked it down to how Spark is now treating nullable properties > within the column being grouped by. > > Here's a simple test I've been able to set up to repro it: > > {code:scala} > case class B(c: Option[Double]) > case class A(b: Option[B]) > val df = Seq( > A(None), > A(Some(B(None))), > A(Some(B(Some(1.0 > ).toDF > val res = df.groupBy("b").agg(count("*")) > {code} > Spark 2.4.6 has the expected result: > {noformat} > > res.show > +-++ > |b|count(1)| > +-++ > | []| 1| > | null| 1| > |[1.0]| 1| > +-++ > > res.collect.foreach(println) > [[null],1] > [null,1] > [[1.0],1] > {noformat} > But Spark 3.0.0 has an unexpected result: > {noformat} > > res.show > +-++ > |b|count(1)| > +-++ > | []| 2| > |[1.0]| 1| > +-++ > > res.collect.foreach(println) > [[null],2] > [[1.0],1] > {noformat} > Notice how it has keyed one of the values in be as `[null]`; that is, an > instance of B with a null value for the `c` property instead of a null for > the overall value itself. > Is this an intended change? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties
[ https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149103#comment-17149103 ] Wenchen Fan commented on SPARK-32136: - cc [~viirya] > Spark producing incorrect groupBy results when key is a struct with nullable > properties > --- > > Key: SPARK-32136 > URL: https://issues.apache.org/jira/browse/SPARK-32136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jason Moore >Priority: Major > > I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm > noticing a behaviour change in a particular aggregation we're doing, and I > think I've tracked it down to how Spark is now treating nullable properties > within the column being grouped by. > > Here's a simple test I've been able to set up to repro it: > > {code:scala} > case class B(c: Option[Double]) > case class A(b: Option[B]) > val df = Seq( > A(None), > A(Some(B(None))), > A(Some(B(Some(1.0 > ).toDF > val res = df.groupBy("b").agg(count("*")) > {code} > Spark 2.4.6 has the expected result: > {noformat} > > res.show > +-++ > |b|count(1)| > +-++ > | []| 1| > | null| 1| > |[1.0]| 1| > +-++ > > res.collect.foreach(println) > [[null],1] > [null,1] > [[1.0],1] > {noformat} > But Spark 3.0.0 has an unexpected result: > {noformat} > > res.show > +-++ > |b|count(1)| > +-++ > | []| 2| > |[1.0]| 1| > +-++ > > res.collect.foreach(println) > [[null],2] > [[1.0],1] > {noformat} > Notice how it has keyed one of the values in be as `[null]`; that is, an > instance of B with a null value for the `c` property instead of a null for > the overall value itself. > Is this an intended change? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32143) Fast fail when the AQE skew join produce too many splits
[ https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149098#comment-17149098 ] Apache Spark commented on SPARK-32143: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/28961 > Fast fail when the AQE skew join produce too many splits > > > Key: SPARK-32143 > URL: https://issues.apache.org/jira/browse/SPARK-32143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Priority: Major > > In handling skewed SortMergeJoin, when matching partitions from the left side > and the right side both have skew and split too many partitions, the plan > generation may take a very long time. > Even fallback to normal SMJ, the query cannot success either. So we should > fast fail this query. > In below logs we can see that it took over 1 hour to generate the plan in AQE > when handle a skewed join which produced too many splits. > {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: > Thread-821384] adaptive.OptimizeSkewedJoin:54 : > 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, > split it into *39150* parts. > 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, > split it into *17022* parts. > 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, > split it into 17 parts. > ... > 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32143) Fast fail when the AQE skew join produce too many splits
[ https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32143: Assignee: (was: Apache Spark) > Fast fail when the AQE skew join produce too many splits > > > Key: SPARK-32143 > URL: https://issues.apache.org/jira/browse/SPARK-32143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Priority: Major > > In handling skewed SortMergeJoin, when matching partitions from the left side > and the right side both have skew and split too many partitions, the plan > generation may take a very long time. > Even fallback to normal SMJ, the query cannot success either. So we should > fast fail this query. > In below logs we can see that it took over 1 hour to generate the plan in AQE > when handle a skewed join which produced too many splits. > {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: > Thread-821384] adaptive.OptimizeSkewedJoin:54 : > 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, > split it into *39150* parts. > 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, > split it into *17022* parts. > 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, > split it into 17 parts. > ... > 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32143) Fast fail when the AQE skew join produce too many splits
[ https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149095#comment-17149095 ] Apache Spark commented on SPARK-32143: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/28961 > Fast fail when the AQE skew join produce too many splits > > > Key: SPARK-32143 > URL: https://issues.apache.org/jira/browse/SPARK-32143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Priority: Major > > In handling skewed SortMergeJoin, when matching partitions from the left side > and the right side both have skew and split too many partitions, the plan > generation may take a very long time. > Even fallback to normal SMJ, the query cannot success either. So we should > fast fail this query. > In below logs we can see that it took over 1 hour to generate the plan in AQE > when handle a skewed join which produced too many splits. > {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: > Thread-821384] adaptive.OptimizeSkewedJoin:54 : > 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, > split it into *39150* parts. > 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, > split it into *17022* parts. > 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, > split it into 17 parts. > ... > 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32143) Fast fail when the AQE skew join produce too many splits
[ https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32143: Assignee: Apache Spark > Fast fail when the AQE skew join produce too many splits > > > Key: SPARK-32143 > URL: https://issues.apache.org/jira/browse/SPARK-32143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Major > > In handling skewed SortMergeJoin, when matching partitions from the left side > and the right side both have skew and split too many partitions, the plan > generation may take a very long time. > Even fallback to normal SMJ, the query cannot success either. So we should > fast fail this query. > In below logs we can see that it took over 1 hour to generate the plan in AQE > when handle a skewed join which produced too many splits. > {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: > Thread-821384] adaptive.OptimizeSkewedJoin:54 : > 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, > split it into *39150* parts. > 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, > split it into *17022* parts. > 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, > split it into 17 parts. > ... > 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32143) Fast fail when the AQE skew join produce too many splits
[ https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-32143: --- Description: In handling skewed SortMergeJoin, when matching partitions from the left side and the right side both have skew and split too many partitions, the plan generation may take a very long time. Even fallback to normal SMJ, the query cannot success either. So we should fast fail this query. In below logs we can see that it took over 1 hour to generate the plan in AQE when handle a skewed join which produced too many splits. {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split it into *39150* parts. 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, split it into *17022* parts. 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split it into 17 parts. ... 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 {quote} was: In handling skewed SortMergeJoin, when matching partitions from the left side and the right side both have skew, the plan generation may take a very long time. Even fallback to normal SMJ, the query cannot success either. So we should fast fail this query. In below logs we can see that it took over 1 hour to generate the plan in AQE when handle a skewed join which produced too many splits. {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split it into *39150* parts. 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, split it into *17022* parts. 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split it into 17 parts. ... 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 {quote} > Fast fail when the AQE skew join produce too many splits > > > Key: SPARK-32143 > URL: https://issues.apache.org/jira/browse/SPARK-32143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Priority: Major > > In handling skewed SortMergeJoin, when matching partitions from the left side > and the right side both have skew and split too many partitions, the plan > generation may take a very long time. > Even fallback to normal SMJ, the query cannot success either. So we should > fast fail this query. > In below logs we can see that it took over 1 hour to generate the plan in AQE > when handle a skewed join which produced too many splits. > {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: > Thread-821384] adaptive.OptimizeSkewedJoin:54 : > 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, > split it into *39150* parts. > 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, > split it into *17022* parts. > 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, > split it into 17 parts. > ... > 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32143) Fast fail when the AQE skew join produce too many splits
[ https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-32143: --- Description: In handling skewed SortMergeJoin, when matching partitions from the left side and the right side both have skew, the plan generation may take a very long time. Even fallback to normal SMJ, the query cannot success either. So we should fast fail this query. In below logs we can see that it took over 1 hour to generate the plan in AQE when handle a skewed join which produced too many splits. {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split it into *39150* parts. 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, split it into *17022* parts. 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split it into 17 parts. ... 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 {quote} was: In handling skewed SortMergeJoin, when matching partitions from the left side and the right side both have skew, the the plan generation may take a very long time. Even fallback to normal SMJ, the query cannot success either. So we should fast fail this query. In below logs we can see that it took over 1 hour to generate the plan in AQE when handle a skewed join which produced too many splits. {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split it into *39150* parts. 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, split it into *17022* parts. 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split it into 17 parts. ... 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 {quote} > Fast fail when the AQE skew join produce too many splits > > > Key: SPARK-32143 > URL: https://issues.apache.org/jira/browse/SPARK-32143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Priority: Major > > In handling skewed SortMergeJoin, when matching partitions from the left side > and the right side both have skew, the plan generation may take a very long > time. > Even fallback to normal SMJ, the query cannot success either. So we should > fast fail this query. > In below logs we can see that it took over 1 hour to generate the plan in AQE > when handle a skewed join which produced too many splits. > {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: > Thread-821384] adaptive.OptimizeSkewedJoin:54 : > 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, > split it into *39150* parts. > 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, > split it into *17022* parts. > 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, > split it into 17 parts. > ... > 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32143) Fast fail when the AQE skew join produce too many splits
[ https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149084#comment-17149084 ] Lantao Jin commented on SPARK-32143: A PR will be submitted soon. > Fast fail when the AQE skew join produce too many splits > > > Key: SPARK-32143 > URL: https://issues.apache.org/jira/browse/SPARK-32143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Priority: Major > > In handling skewed SortMergeJoin, when matching partitions from the left side > and the right side both have skew, the the plan generation may take a very > long time. > Even fallback to normal SMJ, the query cannot success either. So we should > fast fail this query. > In below logs we can see that it took over 1 hour to generate the plan in > AQE when handle a skewed join which produced too many splits. > {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: > Thread-821384] adaptive.OptimizeSkewedJoin:54 : > 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, > split it into *39150* parts. > 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, > split it into *17022* parts. > 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, > split it into 17 parts. > ... > 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32143) Fast fail when the AQE skew join produce too many splits
[ https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-32143: --- Description: In handling skewed SortMergeJoin, when matching partitions from the left side and the right side both have skew, the the plan generation may take a very long time. Even fallback to normal SMJ, the query cannot success either. So we should fast fail this query. In below logs we can see that it took over 1 hour to generate the plan in AQE when handle a skewed join which produced too many splits. {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split it into *39150* parts. 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, split it into *17022* parts. 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split it into 17 parts. ... 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 {quote} was: In handling skewed SortMergeJoin, when matching partitions from the left side and the right side both have skew, the the plan generation may take a very long time. Actually, this query cannot success. In below logs we can see that it took over 1 hour to generate the plan in AQE when handle a skewed join which produced too many splits. {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split it into *39150* parts. 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, split it into *17022* parts. 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split it into 17 parts. ... 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 {quote} > Fast fail when the AQE skew join produce too many splits > > > Key: SPARK-32143 > URL: https://issues.apache.org/jira/browse/SPARK-32143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Priority: Major > > In handling skewed SortMergeJoin, when matching partitions from the left side > and the right side both have skew, the the plan generation may take a very > long time. > Even fallback to normal SMJ, the query cannot success either. So we should > fast fail this query. > In below logs we can see that it took over 1 hour to generate the plan in > AQE when handle a skewed join which produced too many splits. > {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: > Thread-821384] adaptive.OptimizeSkewedJoin:54 : > 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, > split it into *39150* parts. > 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, > split it into *17022* parts. > 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, > split it into 17 parts. > ... > 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] > adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32143) Fast fail when the AQE skew join produce too many splits
Lantao Jin created SPARK-32143: -- Summary: Fast fail when the AQE skew join produce too many splits Key: SPARK-32143 URL: https://issues.apache.org/jira/browse/SPARK-32143 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Lantao Jin In handling skewed SortMergeJoin, when matching partitions from the left side and the right side both have skew, the the plan generation may take a very long time. Actually, this query cannot success. In below logs we can see that it took over 1 hour to generate the plan in AQE when handle a skewed join which produced too many splits. {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split it into *39150* parts. 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, split it into *17022* parts. 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split it into 17 parts. ... 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000 {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade
[ https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149083#comment-17149083 ] Jason Moore commented on SPARK-27780: - I encounter this on 3.0.0 running with a much older shuffle service (I think like 2.1.1). Is there any documentation on which shuffle services 3.0.0 will work with? > Shuffle server & client should be versioned to enable smoother upgrade > -- > > Key: SPARK-27780 > URL: https://issues.apache.org/jira/browse/SPARK-27780 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Imran Rashid >Priority: Major > > The external shuffle service is often upgraded at a different time than spark > itself. However, this causes problems when the protocol changes between the > shuffle service and the spark runtime -- this forces users to upgrade > everything simultaneously. > We should add versioning to the shuffle client & server, so they know what > messages the other will support. This would allow better handling of mixed > versions, from better error msgs to allowing some mismatched versions (with > reduced capabilities). > This originally came up in a discussion here: > https://github.com/apache/spark/pull/24565#issuecomment-493496466 > There are a few ways we could do the versioning which we still need to > discuss: > 1) Version specified by config. This allows for mixed versions across the > cluster and rolling upgrades. It also will let a spark 3.0 client talk to a > 2.4 shuffle service. But, may be a nuisance for users to get this right. > 2) Auto-detection during registration with local shuffle service. This makes > the versioning easy for the end user, and can even handle a 2.4 shuffle > service though it does not support the new versioning. However, it will not > handle a rolling upgrade correctly -- if the local shuffle service has been > upgraded, but other nodes in the cluster have not, it will get the version > wrong. > 3) Exchange versions per-connection. When a connection is opened, the server > & client could first exchange messages with their versions, so they know how > to continue communication after that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties
[ https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149073#comment-17149073 ] Jason Moore commented on SPARK-32136: - Here is a similar test, and why it's a problem for what I'm needing to do: {noformat} case class C(d: Double) case class B(c: Option[C]) case class A(b: Option[B]) val df = Seq( A(None), A(Some(B(None))), A(Some(B(Some(C(1.0) ).toDF val res = df.groupBy("b").agg(count("*")) > res.show +---++ | b|count(1)| +---++ | [[]]| 2| |[[1.0]]| 1| +---++ > res.as[(Option[B], Long)].collect java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException: Null value appeared in non-nullable field: - field (class: "scala.Double", name: "d") - option value class: "C" - field (class: "scala.Option", name: "c") - option value class: "B" - field (class: "scala.Option", name: "_1") - root class: "scala.Tuple2" If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). newInstance(class scala.Tuple2) {noformat} Interestingly, and potentially usefully to know, that using an Int instead of a Double above works as expected. > Spark producing incorrect groupBy results when key is a struct with nullable > properties > --- > > Key: SPARK-32136 > URL: https://issues.apache.org/jira/browse/SPARK-32136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jason Moore >Priority: Major > > I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm > noticing a behaviour change in a particular aggregation we're doing, and I > think I've tracked it down to how Spark is now treating nullable properties > within the column being grouped by. > > Here's a simple test I've been able to set up to repro it: > > {code:scala} > case class B(c: Option[Double]) > case class A(b: Option[B]) > val df = Seq( > A(None), > A(Some(B(None))), > A(Some(B(Some(1.0 > ).toDF > val res = df.groupBy("b").agg(count("*")) > {code} > Spark 2.4.6 has the expected result: > {noformat} > > res.show > +-++ > |b|count(1)| > +-++ > | []| 1| > | null| 1| > |[1.0]| 1| > +-++ > > res.collect.foreach(println) > [[null],1] > [null,1] > [[1.0],1] > {noformat} > But Spark 3.0.0 has an unexpected result: > {noformat} > > res.show > +-++ > |b|count(1)| > +-++ > | []| 2| > |[1.0]| 1| > +-++ > > res.collect.foreach(println) > [[null],2] > [[1.0],1] > {noformat} > Notice how it has keyed one of the values in be as `[null]`; that is, an > instance of B with a null value for the `c` property instead of a null for > the overall value itself. > Is this an intended change? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32088) test of pyspark.sql.functions.timestamp_seconds failed if non-american timezone setting
[ https://issues.apache.org/jira/browse/SPARK-32088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149044#comment-17149044 ] huangtianhua commented on SPARK-32088: -- the test success by the modification https://github.com/apache/spark/pull/28959, [~maxgekk], thanks very much:) > test of pyspark.sql.functions.timestamp_seconds failed if non-american > timezone setting > --- > > Key: SPARK-32088 > URL: https://issues.apache.org/jira/browse/SPARK-32088 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: huangtianhua >Assignee: philipse >Priority: Major > Fix For: 3.1.0 > > > The python test failed for aarch64 job, see > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/405/console > since the commit > https://github.com/apache/spark/commit/f0e6d0ec13d9cdadf341d1b976623345bcdb1028#diff-c8de34467c555857b92875bf78bf9d49 > merged: > ** > File > "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/sql/functions.py", > line 1435, in pyspark.sql.functions.timestamp_seconds > Failed example: > time_df.select(timestamp_seconds(time_df.unix_time).alias('ts')).collect() > Expected: > [Row(ts=datetime.datetime(2008, 12, 25, 7, 30))] > Got: > [Row(ts=datetime.datetime(2008, 12, 25, 23, 30))] > ** >1 of 3 in pyspark.sql.functions.timestamp_seconds > ***Test Failed*** 1 failures. > But this is not arm64-related issue, I took test on x86 instance with > timezone setting of UTC, then the test failed too, so I think the expected > datetime is timezone American/**, but seems we have not set the timezone when > doing these timezone sensitive python tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite
[ https://issues.apache.org/jira/browse/SPARK-32142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32142: Assignee: (was: Apache Spark) > Keep the original tests and codes to avoid potential conflicts in dev in > ParquetFilterSuite and ParquetIOSuite > -- > > Key: SPARK-32142 > URL: https://issues.apache.org/jira/browse/SPARK-32142 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > There are some unnecessary diff at > https://github.com/apache/spark/pull/27728#discussion_r397655390, which could > cause potential confusions. It will make more diff when other people match > the codes with it. > It might be best to address the comment and make the diff minimized against > other branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite
[ https://issues.apache.org/jira/browse/SPARK-32142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32142: Assignee: Apache Spark > Keep the original tests and codes to avoid potential conflicts in dev in > ParquetFilterSuite and ParquetIOSuite > -- > > Key: SPARK-32142 > URL: https://issues.apache.org/jira/browse/SPARK-32142 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > There are some unnecessary diff at > https://github.com/apache/spark/pull/27728#discussion_r397655390, which could > cause potential confusions. It will make more diff when other people match > the codes with it. > It might be best to address the comment and make the diff minimized against > other branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite
[ https://issues.apache.org/jira/browse/SPARK-32142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149035#comment-17149035 ] Apache Spark commented on SPARK-32142: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28955 > Keep the original tests and codes to avoid potential conflicts in dev in > ParquetFilterSuite and ParquetIOSuite > -- > > Key: SPARK-32142 > URL: https://issues.apache.org/jira/browse/SPARK-32142 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > There are some unnecessary diff at > https://github.com/apache/spark/pull/27728#discussion_r397655390, which could > cause potential confusions. It will make more diff when other people match > the codes with it. > It might be best to address the comment and make the diff minimized against > other branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite
Hyukjin Kwon created SPARK-32142: Summary: Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite Key: SPARK-32142 URL: https://issues.apache.org/jira/browse/SPARK-32142 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Hyukjin Kwon There are some unnecessary diff at https://github.com/apache/spark/pull/27728#discussion_r397655390, which could cause potential confusions. It will make more diff when other people match the codes with it. It might be best to address the comment and make the diff minimized against other branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32141) Repartition leads to out of memory
Lekshmi Nair created SPARK-32141: Summary: Repartition leads to out of memory Key: SPARK-32141 URL: https://issues.apache.org/jira/browse/SPARK-32141 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 2.4.4 Reporter: Lekshmi Nair We have an application that does aggregation on 7 columns. In order to avoid shuffles we thought of doing repartition on those 7 columns. It works well with 1 to 4tb of data. When it gets over 4Tb, it fails with OOM or disk space. Do we have a better approach to reduce the shuffle ? For our biggest dataset, the spark job never ran with repartition. We are out of options. We do have a 24 node cluster with r5.24X machines and 1TB of disk. Our shuffle partition is set to 6912. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149019#comment-17149019 ] Jungtaek Lim commented on SPARK-32130: -- There might be some tricks to make type inference for timestamps fail faster, but that would depend on the value and will be same (even worse for adding overhead on applying tricks) for worst case. Unless the performance becomes on par or similar on every cases, turning it on by default doesn't seem to be ideal. [~dongjoon] [~cloud_fan] [~hyukjin.kwon] [~maxgekk] Could we please revisit this? > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > To reproduce the issue please untar the attachment - it will have multiple > .json.gz files whose contents will look similar to following > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD >
[jira] [Commented] (SPARK-32140) Add summary to FMClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148985#comment-17148985 ] Apache Spark commented on SPARK-32140: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28960 > Add summary to FMClassificationModel > > > Key: SPARK-32140 > URL: https://issues.apache.org/jira/browse/SPARK-32140 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32140) Add summary to FMClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32140: Assignee: (was: Apache Spark) > Add summary to FMClassificationModel > > > Key: SPARK-32140 > URL: https://issues.apache.org/jira/browse/SPARK-32140 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32140) Add summary to FMClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148984#comment-17148984 ] Apache Spark commented on SPARK-32140: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28960 > Add summary to FMClassificationModel > > > Key: SPARK-32140 > URL: https://issues.apache.org/jira/browse/SPARK-32140 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32140) Add summary to FMClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32140: Assignee: Apache Spark > Add summary to FMClassificationModel > > > Key: SPARK-32140 > URL: https://issues.apache.org/jira/browse/SPARK-32140 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28664) ORDER BY in aggregate function
[ https://issues.apache.org/jira/browse/SPARK-28664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148952#comment-17148952 ] Will Zimmerman commented on SPARK-28664: [~yumwang] - Would this allow for the changing of Null ordering (e.g. NULL FIRST or NULL LAST)? Currently when evaluating the MIN() of a structure, NULL appears to be taken as the minimum possible value, where as it would be nice if the NULL ordering could be changed or evaluation of structs could be similar to MIN() of a value (where NULL are ignored unless that is the only option). Also, I can't find in Spark documentation what the expected behavior of NULL values is within a struct. The documention I'm referring to can be found [https://spark.apache.org/docs/3.0.0-preview/sql-ref-null-semantics.html]. {code:java} SELECT ID ,COLLECT_SET(STRUCT(x,y)) AS collection ,MIN(x) AS min_of_x ,MIN(y) AS min_of_y ,MIN(STRUCT(x,y)) AS min_of_collection ,MAX(STRUCT(x,y)) AS max_of_collection FROM (values(1234390, 12.0, 'string_1'), (1234390, 37.4, 'string_2'), (1234390, 6.9, NULL), (1234390, 3.1, 'string_3'), (1234390, NULL, 'string_4'), (1234390, NULL, NULL)) AS d(ID,x,y) GROUP BY 1 {code} Result of Spark SQL query in Spark 2.4.4 can be seen below, where the desired outcome would be (3.1, string_3). !image-2020-06-30-15-49-46-796.png! > ORDER BY in aggregate function > -- > > Key: SPARK-28664 > URL: https://issues.apache.org/jira/browse/SPARK-28664 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: image-2020-06-30-15-49-46-796.png > > > {code:sql} > SELECT min(x ORDER BY y) FROM (VALUES(1, NULL)) AS d(x,y); > SELECT min(x ORDER BY y) FROM (VALUES(1, 2)) AS d(x,y); > {code} > https://github.com/postgres/postgres/blob/44e95b5728a4569c494fa4ea4317f8a2f50a206b/src/test/regress/sql/aggregates.sql#L978-L982 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28664) ORDER BY in aggregate function
[ https://issues.apache.org/jira/browse/SPARK-28664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Zimmerman updated SPARK-28664: --- Attachment: image-2020-06-30-15-49-46-796.png > ORDER BY in aggregate function > -- > > Key: SPARK-28664 > URL: https://issues.apache.org/jira/browse/SPARK-28664 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: image-2020-06-30-15-49-46-796.png > > > {code:sql} > SELECT min(x ORDER BY y) FROM (VALUES(1, NULL)) AS d(x,y); > SELECT min(x ORDER BY y) FROM (VALUES(1, 2)) AS d(x,y); > {code} > https://github.com/postgres/postgres/blob/44e95b5728a4569c494fa4ea4317f8a2f50a206b/src/test/regress/sql/aggregates.sql#L978-L982 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20249) Add summary for LinearSVCModel
[ https://issues.apache.org/jira/browse/SPARK-20249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-20249: --- Parent: SPARK-32139 Issue Type: Sub-task (was: Improvement) > Add summary for LinearSVCModel > -- > > Key: SPARK-20249 > URL: https://issues.apache.org/jira/browse/SPARK-20249 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Jeff Zhang >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > > I'd like to get the ObjectiveHistory of each iteration. So I want to add > summary for LinearSVCModel as LogisticRegression. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23631) Add summary to RandomForestClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-23631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-23631: --- Parent: SPARK-32139 Issue Type: Sub-task (was: New Feature) > Add summary to RandomForestClassificationModel > -- > > Key: SPARK-23631 > URL: https://issues.apache.org/jira/browse/SPARK-23631 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Evan Zamir >Priority: Major > Labels: bulk-closed > > I'm using the RandomForestClassificationModel and noticed that there is no > summary attribute like there is for LogisticRegressionModel. Specifically, > I'd like to have the roc and pr curves. Is that on the Spark roadmap > anywhere? Is there a reason it hasn't been implemented? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32140) Add summary to FMClassificationModel
Huaxin Gao created SPARK-32140: -- Summary: Add summary to FMClassificationModel Key: SPARK-32140 URL: https://issues.apache.org/jira/browse/SPARK-32140 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.1.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31893) Add a generic ClassificationSummary trait
[ https://issues.apache.org/jira/browse/SPARK-31893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-31893: --- Parent: SPARK-32139 Issue Type: Sub-task (was: Improvement) > Add a generic ClassificationSummary trait > - > > Key: SPARK-31893 > URL: https://issues.apache.org/jira/browse/SPARK-31893 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > > Add a generic ClassificationSummary trait, so all the classification models > can use it to implement model summary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32139) Unify Classification Training Summary
Huaxin Gao created SPARK-32139: -- Summary: Unify Classification Training Summary Key: SPARK-32139 URL: https://issues.apache.org/jira/browse/SPARK-32139 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.1.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148936#comment-17148936 ] Sean R. Owen commented on SPARK-32130: -- So, is the issue that it's trying and failing to parse millions of things as timestamps, as part of schema inference, and almost all of that is wasted because they're not timestamps? is it possible to make the type inference for timestamps fail way faster on inputs that are obviously not dates (e.g. don't start with a number?) > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > To reproduce the issue please untar the attachment - it will have multiple > .json.gz files whose contents will look similar to following > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx
[jira] [Commented] (SPARK-32088) test of pyspark.sql.functions.timestamp_seconds failed if non-american timezone setting
[ https://issues.apache.org/jira/browse/SPARK-32088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148913#comment-17148913 ] Apache Spark commented on SPARK-32088: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28959 > test of pyspark.sql.functions.timestamp_seconds failed if non-american > timezone setting > --- > > Key: SPARK-32088 > URL: https://issues.apache.org/jira/browse/SPARK-32088 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: huangtianhua >Assignee: philipse >Priority: Major > Fix For: 3.1.0 > > > The python test failed for aarch64 job, see > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/405/console > since the commit > https://github.com/apache/spark/commit/f0e6d0ec13d9cdadf341d1b976623345bcdb1028#diff-c8de34467c555857b92875bf78bf9d49 > merged: > ** > File > "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/sql/functions.py", > line 1435, in pyspark.sql.functions.timestamp_seconds > Failed example: > time_df.select(timestamp_seconds(time_df.unix_time).alias('ts')).collect() > Expected: > [Row(ts=datetime.datetime(2008, 12, 25, 7, 30))] > Got: > [Row(ts=datetime.datetime(2008, 12, 25, 23, 30))] > ** >1 of 3 in pyspark.sql.functions.timestamp_seconds > ***Test Failed*** 1 failures. > But this is not arm64-related issue, I took test on x86 instance with > timezone setting of UTC, then the test failed too, so I think the expected > datetime is timezone American/**, but seems we have not set the timezone when > doing these timezone sensitive python tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Pan updated SPARK-32121: -- Issue Type: Bug (was: Test) > ExternalShuffleBlockResolverSuite failed on Windows > --- > > Key: SPARK-32121 > URL: https://issues.apache.org/jira/browse/SPARK-32121 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0, 3.0.1 > Environment: Windows 10 >Reporter: Cheng Pan >Priority: Minor > > The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} > should consider the Windows file separator. > {code} > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 > s <<< FAILURE! - in > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite > [ERROR] > testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite) > Time elapsed: 0 s <<< FAILURE! > org.junit.ComparisonFailure: expected: but > was: > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31336) Support Oracle Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31336. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28863 [https://github.com/apache/spark/pull/28863] > Support Oracle Kerberos login in JDBC connector > --- > > Key: SPARK-31336 > URL: https://issues.apache.org/jira/browse/SPARK-31336 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31935) Hadoop file system config should be effective in data source options
[ https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-31935: --- Affects Version/s: (was: 3.0.1) (was: 3.1.0) 2.4.6 3.0.0 > Hadoop file system config should be effective in data source options > - > > Key: SPARK-31935 > URL: https://issues.apache.org/jira/browse/SPARK-31935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > Data source options should be propagated into the hadoop configuration of > method `checkAndGlobPathIfNecessary` > From org.apache.hadoop.fs.FileSystem.java: > {code:java} > public static FileSystem get(URI uri, Configuration conf) throws > IOException { > String scheme = uri.getScheme(); > String authority = uri.getAuthority(); > if (scheme == null && authority == null) { // use default FS > return get(conf); > } > if (scheme != null && authority == null) { // no authority > URI defaultUri = getDefaultUri(conf); > if (scheme.equals(defaultUri.getScheme())// if scheme matches > default > && defaultUri.getAuthority() != null) { // & default has authority > return get(defaultUri, conf); // return default > } > } > > String disableCacheName = String.format("fs.%s.impl.disable.cache", > scheme); > if (conf.getBoolean(disableCacheName, false)) { > return createFileSystem(uri, conf); > } > return CACHE.get(uri, conf); > } > {code} > With this, we can specify URI schema and authority related configurations for > scanning file systems. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31336) Support Oracle Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31336: - Assignee: Gabor Somogyi > Support Oracle Kerberos login in JDBC connector > --- > > Key: SPARK-31336 > URL: https://issues.apache.org/jira/browse/SPARK-31336 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32132) Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0
[ https://issues.apache.org/jira/browse/SPARK-32132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148848#comment-17148848 ] Juliusz Sompolski commented on SPARK-32132: --- Also 2.4 adds "interval" at the start, while 3.0 does not. E.g. "interval 3 days" in 2.4 and "3 days" in 3.0. I actually think that the new 3.0 results are better / more standard, and I haven't heard about anyone complaining that it broke the way they parse it. Edit: [~cloud_fan] posting now the above comment that I thought I posted yesterday, but it stayed open and not send in an open tab. It causes some issues with unit tests, but I think it shouldn't cause real world problems, and in any case the new format is likely better for the future. Thanks for explaining. > Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0 > -- > > Key: SPARK-32132 > URL: https://issues.apache.org/jira/browse/SPARK-32132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Priority: Minor > > In https://github.com/apache/spark/pull/26418, a setting > spark.sql.dialect.intervalOutputStyle was implemented, to control interval > output style. This PR also removed "toString" from CalendarInterval. This > change got reverted in https://github.com/apache/spark/pull/27304, and the > CalendarInterval.toString got implemented back in > https://github.com/apache/spark/pull/26572. > But it behaves differently now: In 2.4 "4 weeks 2 days" are returned, and 3.0 > returns "30 days". > Thriftserver uses HiveResults.toHiveString, which uses > CalendarInterval.toString to return interval results as string. The results > are now different in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31935) Hadoop file system config should be effective in data source options
[ https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31935: Issue Type: Bug (was: Improvement) > Hadoop file system config should be effective in data source options > - > > Key: SPARK-31935 > URL: https://issues.apache.org/jira/browse/SPARK-31935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > Data source options should be propagated into the hadoop configuration of > method `checkAndGlobPathIfNecessary` > From org.apache.hadoop.fs.FileSystem.java: > {code:java} > public static FileSystem get(URI uri, Configuration conf) throws > IOException { > String scheme = uri.getScheme(); > String authority = uri.getAuthority(); > if (scheme == null && authority == null) { // use default FS > return get(conf); > } > if (scheme != null && authority == null) { // no authority > URI defaultUri = getDefaultUri(conf); > if (scheme.equals(defaultUri.getScheme())// if scheme matches > default > && defaultUri.getAuthority() != null) { // & default has authority > return get(defaultUri, conf); // return default > } > } > > String disableCacheName = String.format("fs.%s.impl.disable.cache", > scheme); > if (conf.getBoolean(disableCacheName, false)) { > return createFileSystem(uri, conf); > } > return CACHE.get(uri, conf); > } > {code} > With this, we can specify URI schema and authority related configurations for > scanning file systems. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32132) Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0
[ https://issues.apache.org/jira/browse/SPARK-32132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148836#comment-17148836 ] Wenchen Fan commented on SPARK-32132: - We did it intentionally in https://github.com/apache/spark/commit/fb60c2a170b40e80285c1255f0635d5ab319cd35 . INTERVAL is an internal data type and can't be written out, and we don't guarantee the string representation stability of interval values. It's indeed hard to read the interval string like "4 weeks 2 days", and I think it's better to use "30 days". Does it cause issues at thriftserver side? > Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0 > -- > > Key: SPARK-32132 > URL: https://issues.apache.org/jira/browse/SPARK-32132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Priority: Minor > > In https://github.com/apache/spark/pull/26418, a setting > spark.sql.dialect.intervalOutputStyle was implemented, to control interval > output style. This PR also removed "toString" from CalendarInterval. This > change got reverted in https://github.com/apache/spark/pull/27304, and the > CalendarInterval.toString got implemented back in > https://github.com/apache/spark/pull/26572. > But it behaves differently now: In 2.4 "4 weeks 2 days" are returned, and 3.0 > returns "30 days". > Thriftserver uses HiveResults.toHiveString, which uses > CalendarInterval.toString to return interval results as string. The results > are now different in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32026) Add PrometheusServletSuite
[ https://issues.apache.org/jira/browse/SPARK-32026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-32026: --- Description: This Jira aims to be added _PrometheusServletSuite_. *Note:* This jira was created to propose _Prometheus_ driver metric format change to be consistent with executor metric format. However, currently, _PrometheusServlet_ follows _Spark 3.0 JMX Sink + Prometheus JMX Exporter_ format so this Jira coverage has been converted to just _PrometheusServletSuite_ in the light of discussion on https://github.com/apache/spark/pull/28865. was: Spark 3.0 introduces native Prometheus Sink for both driver and executor metrics. However, they need consistency on format (e.g: `applicationId`). Currently, driver covers `applicationId` in metric name. If this can extract as executor metric format, this can also support consistency and help to query. *Driver* {code:java} metrics_local_1592242896665_driver_BlockManager_memory_memUsed_MB_Value{type="gauges"} 0{code} *Executor* {code:java} metrics_executor_memoryUsed_bytes{application_id="local-1592242896665", application_name="apache-spark-fundamentals", executor_id="driver"} 24356{code} *Proposed Driver Format* {code:java} metrics_driver_BlockManager_memory_memUsed_MB_Value{application_id="local-1592242896665", type="gauges"} 0{code} > Add PrometheusServletSuite > -- > > Key: SPARK-32026 > URL: https://issues.apache.org/jira/browse/SPARK-32026 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Eren Avsarogullari >Priority: Major > > This Jira aims to be added _PrometheusServletSuite_. > *Note:* This jira was created to propose _Prometheus_ driver metric format > change to be consistent with executor metric format. However, currently, > _PrometheusServlet_ follows _Spark 3.0 JMX Sink + Prometheus JMX Exporter_ > format so this Jira coverage has been converted to just > _PrometheusServletSuite_ in the light of discussion on > https://github.com/apache/spark/pull/28865. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148790#comment-17148790 ] Kousuke Saruta commented on SPARK-32119: Yeah I know it works with extraClassPath but as you mention, it's a limitation. > ExecutorPlugin doesn't work with Standalone Cluster > --- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation
[ https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32138: Assignee: Apache Spark > Drop Python 2, 3.4 and 3.5 in codes and documentation > - > > Key: SPARK-32138 > URL: https://issues.apache.org/jira/browse/SPARK-32138 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation
[ https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32138: Assignee: (was: Apache Spark) > Drop Python 2, 3.4 and 3.5 in codes and documentation > - > > Key: SPARK-32138 > URL: https://issues.apache.org/jira/browse/SPARK-32138 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation
[ https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148766#comment-17148766 ] Apache Spark commented on SPARK-32138: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28957 > Drop Python 2, 3.4 and 3.5 in codes and documentation > - > > Key: SPARK-32138 > URL: https://issues.apache.org/jira/browse/SPARK-32138 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation
[ https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32138: - Summary: Drop Python 2, 3.4 and 3.5 in codes and documentation (was: Drop Python 2, 3.4 and 3.5 in the main and dev codes) > Drop Python 2, 3.4 and 3.5 in codes and documentation > - > > Key: SPARK-32138 > URL: https://issues.apache.org/jira/browse/SPARK-32138 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29802) Update remaining python scripts in repo to python3 shebang
[ https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29802: - Summary: Update remaining python scripts in repo to python3 shebang (was: update remaining python scripts in repo to python3 shebang) > Update remaining python scripts in repo to python3 shebang > -- > > Key: SPARK-29802 > URL: https://issues.apache.org/jira/browse/SPARK-29802 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > > there are a bunch of scripts in the repo that need to have their shebang > updated to python3: > {noformat} > dev/create-release/releaseutils.py:#!/usr/bin/env python > dev/create-release/generate-contributors.py:#!/usr/bin/env python > dev/create-release/translate-contributors.py:#!/usr/bin/env python > dev/github_jira_sync.py:#!/usr/bin/env python > dev/merge_spark_pr.py:#!/usr/bin/env python > python/pyspark/version.py:#!/usr/bin/env python > python/pyspark/find_spark_home.py:#!/usr/bin/env python > python/setup.py:#!/usr/bin/env python{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29919) Remove python2 test execution in Jenkins environment
[ https://issues.apache.org/jira/browse/SPARK-29919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29919: - Summary: Remove python2 test execution in Jenkins environment (was: remove python2 test execution) > Remove python2 test execution in Jenkins environment > > > Key: SPARK-29919 > URL: https://issues.apache.org/jira/browse/SPARK-29919 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > remove python2.7 (including pypy2) test executables from 'python/run-tests.py' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148755#comment-17148755 ] Hyukjin Kwon edited comment on SPARK-29803 at 6/30/20, 3:07 PM: I will do it at SPARK-32138 was (Author: hyukjin.kwon): I will do it at SPARK-29909 > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in the main and dev codes
Hyukjin Kwon created SPARK-32138: Summary: Drop Python 2, 3.4 and 3.5 in the main and dev codes Key: SPARK-32138 URL: https://issues.apache.org/jira/browse/SPARK-32138 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29919) remove python2 test execution
[ https://issues.apache.org/jira/browse/SPARK-29919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29919: - Parent: SPARK-29909 Issue Type: Sub-task (was: Improvement) > remove python2 test execution > - > > Key: SPARK-29919 > URL: https://issues.apache.org/jira/browse/SPARK-29919 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > remove python2.7 (including pypy2) test executables from 'python/run-tests.py' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29919) remove python2 test execution
[ https://issues.apache.org/jira/browse/SPARK-29919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29919: - Parent: (was: SPARK-29909) Issue Type: Improvement (was: Sub-task) > remove python2 test execution > - > > Key: SPARK-29919 > URL: https://issues.apache.org/jira/browse/SPARK-29919 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > remove python2.7 (including pypy2) test executables from 'python/run-tests.py' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148755#comment-17148755 ] Hyukjin Kwon commented on SPARK-29803: -- I will do it at SPARK-29909 > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29803: - Parent: (was: SPARK-29909) Issue Type: Improvement (was: Sub-task) > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29803. -- Resolution: Duplicate > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148746#comment-17148746 ] Thomas Graves commented on SPARK-32119: --- You can specify the jars in extraClassPath but it requires them to be on the node. > ExecutorPlugin doesn't work with Standalone Cluster > --- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32135) Show Spark Driver name on Spark history web page
[ https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148742#comment-17148742 ] Thomas Graves commented on SPARK-32135: --- [~gaurangi]can you please clarify what you mean by "driver host" and where you want to see it. The history server is an independent process and could be on a separate host from everything else. If you are talking within an application you can see it by going to the executors page. > Show Spark Driver name on Spark history web page > > > Key: SPARK-32135 > URL: https://issues.apache.org/jira/browse/SPARK-32135 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > > We would like to see spark driver host on the history server web page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31816) Create high level description about JDBC connection providers for users/developers
[ https://issues.apache.org/jira/browse/SPARK-31816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148712#comment-17148712 ] Gabor Somogyi commented on SPARK-31816: --- Originally I've thought it's enough to add developer description but since I've received several questions from users what's supported then I've reconsidered my original idea. I'm planning to add user and developer description as well. > Create high level description about JDBC connection providers for > users/developers > -- > > Key: SPARK-31816 > URL: https://issues.apache.org/jira/browse/SPARK-31816 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31816) Create high level description about JDBC connection providers for users/developers
[ https://issues.apache.org/jira/browse/SPARK-31816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-31816: -- Summary: Create high level description about JDBC connection providers for users/developers (was: Create high level description about JDBC connection providers for developers) > Create high level description about JDBC connection providers for > users/developers > -- > > Key: SPARK-31816 > URL: https://issues.apache.org/jira/browse/SPARK-31816 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32068) Spark 3 UI task launch time show in error time zone
[ https://issues.apache.org/jira/browse/SPARK-32068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-32068: - Assignee: JinxinTang > Spark 3 UI task launch time show in error time zone > --- > > Key: SPARK-32068 > URL: https://issues.apache.org/jira/browse/SPARK-32068 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Smith Cruise >Assignee: JinxinTang >Priority: Major > Labels: easyfix > Fix For: 3.0.1, 3.1.0 > > Attachments: correct.png, incorrect.png > > > For example, > In this link: history/app-20200623133209-0015/stages/ , stage submit time is > correct (CST) > > But in this link: > history/app-20200623133209-0015/stages/stage/?id=0=0 , task launch > time is incorrect(UTC) > > The same problem exists in port 4040 Web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32068) Spark 3 UI task launch time show in error time zone
[ https://issues.apache.org/jira/browse/SPARK-32068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-32068. --- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed > Spark 3 UI task launch time show in error time zone > --- > > Key: SPARK-32068 > URL: https://issues.apache.org/jira/browse/SPARK-32068 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Smith Cruise >Priority: Major > Labels: easyfix > Fix For: 3.0.1, 3.1.0 > > Attachments: correct.png, incorrect.png > > > For example, > In this link: history/app-20200623133209-0015/stages/ , stage submit time is > correct (CST) > > But in this link: > history/app-20200623133209-0015/stages/stage/?id=0=0 , task launch > time is incorrect(UTC) > > The same problem exists in port 4040 Web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31797) Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
[ https://issues.apache.org/jira/browse/SPARK-31797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148694#comment-17148694 ] Apache Spark commented on SPARK-31797: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/28956 > Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions > --- > > Key: SPARK-31797 > URL: https://issues.apache.org/jira/browse/SPARK-31797 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 > Environment: [PR link|https://github.com/apache/spark/pull/28534] >Reporter: JinxinTang >Assignee: JinxinTang >Priority: Major > Fix For: 3.1.0 > > > 1.Add and register three new functions: {{TIMESTAMP_SECONDS}}, > {{TIMESTAMP_MILLIS}} and {{TIMESTAMP_MICROS}} > Reference: > [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds] > 2.People will have convenient way to get timestamps from seconds,milliseconds > and microseconds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31797) Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
[ https://issues.apache.org/jira/browse/SPARK-31797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148696#comment-17148696 ] Apache Spark commented on SPARK-31797: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/28956 > Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions > --- > > Key: SPARK-31797 > URL: https://issues.apache.org/jira/browse/SPARK-31797 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 > Environment: [PR link|https://github.com/apache/spark/pull/28534] >Reporter: JinxinTang >Assignee: JinxinTang >Priority: Major > Fix For: 3.1.0 > > > 1.Add and register three new functions: {{TIMESTAMP_SECONDS}}, > {{TIMESTAMP_MILLIS}} and {{TIMESTAMP_MICROS}} > Reference: > [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds] > 2.People will have convenient way to get timestamps from seconds,milliseconds > and microseconds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148663#comment-17148663 ] Sanjeev Mishra edited comment on SPARK-32130 at 6/30/20, 1:35 PM: -- I tried to load entire dataset using above suggestions and time did come close but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but question is why?). I would definitely agree above comment that if the default behavior is changed the community MUST be notified otherwise it is a huge time waste for all. Here are the various results *Spark 2.4* spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"){color}.json("/data/20200528/").count) Time taken: {color:#ff}29706 ms{color} res0: Long = 2605349 spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count) Time taken: {color:#de350b}31431 ms{color} res0: Long = 2605349 *Spark 3.0* spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"{color}).json("/data/20200528/").count) Time taken: {color:#de350b}32826 ms{color} res0: Long = 2605349 spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count) Time taken: {color:#de350b}34011 ms{color} res0: Long = 2605349 {color:#de350b}*Note:*{color} # Make sure {color:#de350b}you never turn on prefersDecimal to true{color} even when inferTimestamp is false, it again takes huge amount of time. # Spark 3.0 + JDK 11 is slower than Spark 3.0 + JDK 8 by almost 6 sec. was (Author: lotus2you): I tried to load entire dataset using above suggestions and time did come close but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but question is why?). I would definitely agree above comment that if the default behavior is changed the community MUST be notified otherwise it is a huge time waste for all. Here are the various results *Spark 2.4* spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"){color}.json("/data/20200528/").count) Time taken: {color:#FF}29706 ms{color} res0: Long = 2605349 spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count) Time taken: {color:#de350b}31431 ms{color} res0: Long = 2605349 *Spark 3.0* spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"{color}).json("/data/20200528/").count) Time taken: {color:#de350b}32826 ms{color} res0: Long = 2605349 spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count) Time taken: {color:#de350b}34011 ms{color} res0: Long = 2605349 {color:#de350b}*Note:*{color} Make sure {color:#de350b}you never turn on prefersDecimal to true{color} even when inferTimestamp is false, it again takes huge amount of time. > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers
[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148663#comment-17148663 ] Sanjeev Mishra commented on SPARK-32130: I tried to load entire dataset using above suggestions and time did come close but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but question is why?). I would definitely agree above comment that if the default behavior is changed the community MUST be notified otherwise it is a huge time waste for all. Here are the various results *Spark 2.4* spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"){color}.json("/data/20200528/").count) Time taken: {color:#FF}29706 ms{color} res0: Long = 2605349 spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count) Time taken: {color:#de350b}31431 ms{color} res0: Long = 2605349 *Spark 3.0* spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"{color}).json("/data/20200528/").count) Time taken: {color:#de350b}32826 ms{color} res0: Long = 2605349 spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count) Time taken: {color:#de350b}34011 ms{color} res0: Long = 2605349 {color:#de350b}*Note:*{color} Make sure {color:#de350b}you never turn on prefersDecimal to true{color} even when inferTimestamp is false, it again takes huge amount of time. > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > To reproduce the issue please untar the attachment - it will have multiple > .json.gz files whose contents will look similar to following > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F >
[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sanjeev Mishra updated SPARK-32130: --- Description: We are planning to move to Spark 3 but the read performance of our json files is unacceptable. Following is the performance numbers when compared to Spark 2.4 Spark 2.4 scala> spark.time(spark.read.json("/data/20200528")) Time taken: {color:#ff}19691 ms{color} res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields] scala> spark.time(res61.count()) Time taken: {color:#ff}7113 ms{color} res64: Long = 2605349 Spark 3.0 scala> spark.time(spark.read.json("/data/20200528")) 20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. Time taken: {color:#ff}849652 ms{color} res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields] scala> spark.time(res0.count()) Time taken: {color:#ff}8201 ms{color} res2: Long = 2605349 I am attaching a sample data (please delete is once you are able to reproduce the issue) that is much smaller than the actual size but the performance comparison can still be verified. The sample tar contains bunch of json.gz files, each line of the file is self contained json doc as shown below To reproduce the issue please untar the attachment - it will have multiple .json.gz files whose contents will look similar to following {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\\{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}{color} {quote} {quote}{color:#741b47}{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":\{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD
[jira] [Updated] (SPARK-29919) remove python2 test execution
[ https://issues.apache.org/jira/browse/SPARK-29919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29919: - Parent: (was: SPARK-27884) Issue Type: Bug (was: Sub-task) > remove python2 test execution > - > > Key: SPARK-29919 > URL: https://issues.apache.org/jira/browse/SPARK-29919 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > remove python2.7 (including pypy2) test executables from 'python/run-tests.py' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29919) remove python2 test execution
[ https://issues.apache.org/jira/browse/SPARK-29919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29919: - Parent: SPARK-29909 Issue Type: Sub-task (was: Bug) > remove python2 test execution > - > > Key: SPARK-29919 > URL: https://issues.apache.org/jira/browse/SPARK-29919 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > remove python2.7 (including pypy2) test executables from 'python/run-tests.py' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31060) Handle column names containing `dots` in data source `Filter`
[ https://issues.apache.org/jira/browse/SPARK-31060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148594#comment-17148594 ] Apache Spark commented on SPARK-31060: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28955 > Handle column names containing `dots` in data source `Filter` > - > > Key: SPARK-31060 > URL: https://issues.apache.org/jira/browse/SPARK-31060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31026) Parquet predicate pushdown on columns with dots
[ https://issues.apache.org/jira/browse/SPARK-31026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148593#comment-17148593 ] Apache Spark commented on SPARK-31026: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28955 > Parquet predicate pushdown on columns with dots > --- > > Key: SPARK-31026 > URL: https://issues.apache.org/jira/browse/SPARK-31026 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > Parquet predicate pushdown on columns with dots is disabled in -SPARK-20364- > due to Parquet's APIs don't support it. A new set of APIs is purposed in > PARQUET-1809 to generalize the support of nested cols which can address this > issue. This implementation will be merged into Spark repo first until we get > a new release from Parquet community. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31060) Handle column names containing `dots` in data source `Filter`
[ https://issues.apache.org/jira/browse/SPARK-31060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148592#comment-17148592 ] Apache Spark commented on SPARK-31060: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28955 > Handle column names containing `dots` in data source `Filter` > - > > Key: SPARK-31060 > URL: https://issues.apache.org/jira/browse/SPARK-31060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31026) Parquet predicate pushdown on columns with dots
[ https://issues.apache.org/jira/browse/SPARK-31026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148590#comment-17148590 ] Apache Spark commented on SPARK-31026: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28955 > Parquet predicate pushdown on columns with dots > --- > > Key: SPARK-31026 > URL: https://issues.apache.org/jira/browse/SPARK-31026 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > Parquet predicate pushdown on columns with dots is disabled in -SPARK-20364- > due to Parquet's APIs don't support it. A new set of APIs is purposed in > PARQUET-1809 to generalize the support of nested cols which can address this > issue. This implementation will be merged into Spark repo first until we get > a new release from Parquet community. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17636) Parquet predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148589#comment-17148589 ] Apache Spark commented on SPARK-17636: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28955 > Parquet predicate pushdown for nested fields > > > Key: SPARK-17636 > URL: https://issues.apache.org/jira/browse/SPARK-17636 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 1.6.2, 1.6.3, 2.0.2 >Reporter: Mitesh >Assignee: DB Tsai >Priority: Critical > Fix For: 3.0.0 > > > There's a *PushedFilters* for a simple numeric field, but not for a numeric > field inside a struct. Not sure if this is a Spark limitation because of > Parquet, or only a Spark limitation. > {noformat} > scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", > "sale_id") > res5: org.apache.spark.sql.DataFrame = [day_timestamp: > struct, sale_id: bigint] > scala> res5.filter("sale_id > 4").queryExecution.executedPlan > res9: org.apache.spark.sql.execution.SparkPlan = > Filter[23814] [args=(sale_id#86324L > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)] > scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan > res10: org.apache.spark.sql.execution.SparkPlan = > Filter[23815] [args=(day_timestamp#86302.timestamp > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17636) Parquet predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148588#comment-17148588 ] Apache Spark commented on SPARK-17636: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28955 > Parquet predicate pushdown for nested fields > > > Key: SPARK-17636 > URL: https://issues.apache.org/jira/browse/SPARK-17636 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 1.6.2, 1.6.3, 2.0.2 >Reporter: Mitesh >Assignee: DB Tsai >Priority: Critical > Fix For: 3.0.0 > > > There's a *PushedFilters* for a simple numeric field, but not for a numeric > field inside a struct. Not sure if this is a Spark limitation because of > Parquet, or only a Spark limitation. > {noformat} > scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", > "sale_id") > res5: org.apache.spark.sql.DataFrame = [day_timestamp: > struct, sale_id: bigint] > scala> res5.filter("sale_id > 4").queryExecution.executedPlan > res9: org.apache.spark.sql.execution.SparkPlan = > Filter[23814] [args=(sale_id#86324L > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)] > scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan > res10: org.apache.spark.sql.execution.SparkPlan = > Filter[23815] [args=(day_timestamp#86302.timestamp > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25556) Predicate Pushdown for Nested fields
[ https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148587#comment-17148587 ] Apache Spark commented on SPARK-25556: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28955 > Predicate Pushdown for Nested fields > > > Key: SPARK-25556 > URL: https://issues.apache.org/jira/browse/SPARK-25556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > This is an umbrella JIRA to support predicate pushdown for nested fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE
[ https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148585#comment-17148585 ] Apache Spark commented on SPARK-32083: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/28954 > Unnecessary tasks are launched when input is empty with AQE > --- > > Key: SPARK-32083 > URL: https://issues.apache.org/jira/browse/SPARK-32083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > [https://github.com/apache/spark/pull/28226] meant to avoid launching > unnecessary tasks for 0-size partitions when AQE is enabled. However, when > all partitions are empty, the number of partitions will be > `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) > unnecessary tasks are launched in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE
[ https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148584#comment-17148584 ] Apache Spark commented on SPARK-32083: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/28954 > Unnecessary tasks are launched when input is empty with AQE > --- > > Key: SPARK-32083 > URL: https://issues.apache.org/jira/browse/SPARK-32083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > [https://github.com/apache/spark/pull/28226] meant to avoid launching > unnecessary tasks for 0-size partitions when AQE is enabled. However, when > all partitions are empty, the number of partitions will be > `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) > unnecessary tasks are launched in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32137) AttributeError: Can only use .dt accessor with datetimelike values
[ https://issues.apache.org/jira/browse/SPARK-32137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lacalle Castillo updated SPARK-32137: --- Priority: Critical (was: Major) > AttributeError: Can only use .dt accessor with datetimelike values > -- > > Key: SPARK-32137 > URL: https://issues.apache.org/jira/browse/SPARK-32137 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.5 >Reporter: David Lacalle Castillo >Priority: Critical > > I was using a pandas udf with a dataframe containing a date object. I was > using the lastversion of pyarrow, 0.17.0. > I setup this variable on zeppelin spark interpreter: > ARROW_PRE_0_15_IPC_FORMAT=1 > > However, I was getting the following error: > Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 19.0 (TID 1619, 10.20.0.5, executor > 1): org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main > process() > File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in > process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 290, > in dump_stream > for series in iterator: > File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, > in load_stream > yield [self.arrow_to_pandas(c) for c in > pa.Table.from_batches([batch]).itercolumns()] > File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, > in > yield [self.arrow_to_pandas(c) for c in > pa.Table.from_batches([batch]).itercolumns()] > File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 278, > in arrow_to_pandas > s = _check_series_convert_date(s, from_arrow_type(arrow_column.type)) > File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1692, in > _check_series_convert_date > return series.dt.date > File "/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py", line > 5270, in getattr > return object.getattribute(self, name) > File "/usr/local/lib/python3.7/dist-packages/pandas/core/accessor.py", line > 187, in get > accessor_obj = self._accessor(obj) > File > "/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/accessors.py", > line 338, in new > raise AttributeError("Can only use .dt accessor with datetimelike values") > AttributeError: Can only use .dt accessor with datetimelike values > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at >
[jira] [Created] (SPARK-32137) AttributeError: Can only use .dt accessor with datetimelike values
David Lacalle Castillo created SPARK-32137: -- Summary: AttributeError: Can only use .dt accessor with datetimelike values Key: SPARK-32137 URL: https://issues.apache.org/jira/browse/SPARK-32137 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.4.5 Reporter: David Lacalle Castillo I was using a pandas udf with a dataframe containing a date object. I was using the lastversion of pyarrow, 0.17.0. I setup this variable on zeppelin spark interpreter: ARROW_PRE_0_15_IPC_FORMAT=1 However, I was getting the following error: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 1619, 10.20.0.5, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main process() File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 290, in dump_stream for series in iterator: File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, in load_stream yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()] File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, in yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()] File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 278, in arrow_to_pandas s = _check_series_convert_date(s, from_arrow_type(arrow_column.type)) File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1692, in _check_series_convert_date return series.dt.date File "/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py", line 5270, in getattr return object.getattribute(self, name) File "/usr/local/lib/python3.7/dist-packages/pandas/core/accessor.py", line 187, in get accessor_obj = self._accessor(obj) File "/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/accessors.py", line 338, in new raise AttributeError("Can only use .dt accessor with datetimelike values") AttributeError: Can only use .dt accessor with datetimelike values at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148570#comment-17148570 ] Jungtaek Lim commented on SPARK-32130: -- Looks like we already saw the difference but we missed to consider the performance regression. Please look at the table in PR description of https://github.com/apache/spark/pull/23653. - inferTimestamp=default (true) & prefersDecimal=default (false) took 6.1 minutes - inferTimestamp=false & prefersDecimal=default (false) took 49 seconds The difference simply came from inferTimestamp and it was noticeable. > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First >
[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148559#comment-17148559 ] Jungtaek Lim commented on SPARK-32130: -- So anyone can just reproduce via running spark-shell on Spark 3.0.0 and run the query, and shut down, and rerun spark-shell, and run another query. {code} spark.time(spark.read.option("inferTimestamp", "false").json("datadir").count()) {code} {code} spark.time(spark.read.option("inferTimestamp", "true").json("datadir").count()) {code} {quote} it might be of use to consult the entire community by writing to the user group if we are changing the default behaviour. Because it might affect a lot of production code. {quote} The default behavior was changed in Spark 3.0.0 - I agree it's bad user experience to make it back and forth, but if it contributes the performance silently, IMHO the option should be ideally turned off by default so that end users will turn on by their own intention. > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD >
[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148546#comment-17148546 ] Jungtaek Lim commented on SPARK-32130: -- For me it's reproduced consistently. Please make sure you run the query in spark-shell with default setting of Spark 2.4.5 vs Spark 3.0.0. To remove any variant, don't add "multiLine" option. spark.time(spark.read.json("datadir").count()) Spark 2.4.5: 3215ms Spark 3.0.0: 13317ms adding `.option("inferTimestamp", "false")` before .json leads the elapsed time closer - 4716 ms - still have a difference. > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First >
[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148501#comment-17148501 ] JinxinTang commented on SPARK-32130: [~gourav.sengupta] Nice notebook, is seems the row count() is 33447 instead of 11. it can be explained because 3s different is due to jvm warm up if record count is 11 instead of 33447. > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First >
[jira] [Comment Edited] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148459#comment-17148459 ] Gourav edited comment on SPARK-32130 at 6/30/20, 9:02 AM: -- [~JinxinTang] and [~lotus2you] I think that it is only the first time that the time taken by SPARK 3.x is more, and in the subsequent times, its less, as the schema would have already been determined. Setting the inferTimestamp to be False does not improve the performance as well.[^SPARK 32130 - replication and findings.ipynb][^SPARK 32130 - replication and findings.ipynb] Note book with details and full reproducibility is attached. [~kabhwan] it might be of use to consult the entire community by writing to the user group if we are changing the default behaviour. Because it might affect a lot of production code. was (Author: gourav.sengupta): [~JinxinTang] and [~lotus2you] I think that it is only the first time that the time taken by SPARK 3.x is more, and in the subsequent times, its less, as the schema would have already been determined. Setting the inferTimestamp to be False does not improve the performance as well.[^SPARK 32130 - replication and findings.ipynb][^SPARK 32130 - replication and findings.ipynb] Note book with details are attached. [~kabhwan] it might be of use to consult the entire community by writing to the user group if we are changing the default behaviour. Because it might affect a lot of production code. > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed >
[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148459#comment-17148459 ] Gourav commented on SPARK-32130: [~JinxinTang] and [~lotus2you] I think that it is only the first time that the time taken by SPARK 3.x is more, and in the subsequent times, its less, as the schema would have already been determined. Setting the inferTimestamp to be False does not improve the performance as well.[^SPARK 32130 - replication and findings.ipynb][^SPARK 32130 - replication and findings.ipynb] Note book with details are attached. [~kabhwan] it might be of use to consult the entire community by writing to the user group if we are changing the default behaviour. Because it might affect a lot of production code. > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD >
[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gourav updated SPARK-32130: --- Attachment: SPARK 32130 - replication and findings.ipynb > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First >
[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gourav updated SPARK-32130: --- Attachment: (was: SPARK 32130 - replication and findings.ipynb) > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First >
[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gourav updated SPARK-32130: --- Attachment: SPARK 32130 - replication and findings.ipynb > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First >
[jira] [Updated] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties
[ https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Moore updated SPARK-32136: Description: I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm noticing a behaviour change in a particular aggregation we're doing, and I think I've tracked it down to how Spark is now treating nullable properties within the column being grouped by. Here's a simple test I've been able to set up to repro it: {code:scala} case class B(c: Option[Double]) case class A(b: Option[B]) val df = Seq( A(None), A(Some(B(None))), A(Some(B(Some(1.0 ).toDF val res = df.groupBy("b").agg(count("*")) {code} Spark 2.4.6 has the expected result: {noformat} > res.show +-++ |b|count(1)| +-++ | []| 1| | null| 1| |[1.0]| 1| +-++ > res.collect.foreach(println) [[null],1] [null,1] [[1.0],1] {noformat} But Spark 3.0.0 has an unexpected result: {noformat} > res.show +-++ |b|count(1)| +-++ | []| 2| |[1.0]| 1| +-++ > res.collect.foreach(println) [[null],2] [[1.0],1] {noformat} Notice how it has keyed one of the values in be as `[null]`; that is, an instance of B with a null value for the `c` property instead of a null for the overall value itself. Is this an intended change? was: I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm noticing a behaviour change in a particular aggregation we're doing, and I think I've tracked it down to how Spark is now treating nullable properties within the column being grouped by. Here's a simple test I've been able to set up to repro it: {code:scala} case class B(c: Option[Double]) case class A(b: Option[B]) val df = Seq( A(None), A(Some(B(None))), A(Some(B(Some(1.0 ).toDF val res = df.groupBy("b").agg(count("*")) {code} Spark 2.4.6 has the expected result: {noformat} > res.show +-++ |b|count(1)| +-++ | []| 1| | null| 1| |[1.0]| 1| +-++ > res.collect.foreach(println) [[null],1] [null,1] [[1.0],1] {noformat} But Spark 3.0.0 has an unexpected result: {noformat} > res.show +-++ |b|count(1)| +-++ | []| 2| |[1.0]| 1| +-++ > res.collect.foreach(println) [[null],2] [[1.0],1] {noformat} Notice how it has keyed one of the values in be as `[null]`; that is, an instance of B with a null value for the `c` property instead of a null for the overall value itself. Is this an intended change? > Spark producing incorrect groupBy results when key is a struct with nullable > properties > --- > > Key: SPARK-32136 > URL: https://issues.apache.org/jira/browse/SPARK-32136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jason Moore >Priority: Major > > I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm > noticing a behaviour change in a particular aggregation we're doing, and I > think I've tracked it down to how Spark is now treating nullable properties > within the column being grouped by. > > Here's a simple test I've been able to set up to repro it: > > {code:scala} > case class B(c: Option[Double]) > case class A(b: Option[B]) > val df = Seq( > A(None), > A(Some(B(None))), > A(Some(B(Some(1.0 > ).toDF > val res = df.groupBy("b").agg(count("*")) > {code} > Spark 2.4.6 has the expected result: > {noformat} > > res.show > +-++ > |b|count(1)| > +-++ > | []| 1| > | null| 1| > |[1.0]| 1| > +-++ > > res.collect.foreach(println) > [[null],1] > [null,1] > [[1.0],1] > {noformat} > But Spark 3.0.0 has an unexpected result: > {noformat} > > res.show > +-++ > |b|count(1)| > +-++ > | []| 2| > |[1.0]| 1| > +-++ > > res.collect.foreach(println) > [[null],2] > [[1.0],1] > {noformat} > Notice how it has keyed one of the values in be as `[null]`; that is, an > instance of B with a null value for the `c` property instead of a null for > the overall value itself. > Is this an intended change? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties
Jason Moore created SPARK-32136: --- Summary: Spark producing incorrect groupBy results when key is a struct with nullable properties Key: SPARK-32136 URL: https://issues.apache.org/jira/browse/SPARK-32136 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Jason Moore I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm noticing a behaviour change in a particular aggregation we're doing, and I think I've tracked it down to how Spark is now treating nullable properties within the column being grouped by. Here's a simple test I've been able to set up to repro it: {code:scala} case class B(c: Option[Double]) case class A(b: Option[B]) val df = Seq( A(None), A(Some(B(None))), A(Some(B(Some(1.0 ).toDF val res = df.groupBy("b").agg(count("*")) {code} Spark 2.4.6 has the expected result: {noformat} > res.show +-++ |b|count(1)| +-++ | []| 1| | null| 1| |[1.0]| 1| +-++ > res.collect.foreach(println) [[null],1] [null,1] [[1.0],1] {noformat} But Spark 3.0.0 has an unexpected result: {noformat} > res.show +-++ |b|count(1)| +-++ | []| 2| |[1.0]| 1| +-++ > res.collect.foreach(println) [[null],2] [[1.0],1] {noformat} Notice how it has keyed one of the values in be as `[null]`; that is, an instance of B with a null value for the `c` property instead of a null for the overall value itself. Is this an intended change? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org