[jira] [Updated] (SPARK-31935) Hadoop file system config should be effective in data source options

2020-06-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31935:

Fix Version/s: 2.4.7

> Hadoop file system config should be effective in data source options 
> -
>
> Key: SPARK-31935
> URL: https://issues.apache.org/jira/browse/SPARK-31935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> Data source options should be propagated into the hadoop configuration of 
> method `checkAndGlobPathIfNecessary`
> From org.apache.hadoop.fs.FileSystem.java:
> {code:java}
>   public static FileSystem get(URI uri, Configuration conf) throws 
> IOException {
> String scheme = uri.getScheme();
> String authority = uri.getAuthority();
> if (scheme == null && authority == null) { // use default FS
>   return get(conf);
> }
> if (scheme != null && authority == null) { // no authority
>   URI defaultUri = getDefaultUri(conf);
>   if (scheme.equals(defaultUri.getScheme())// if scheme matches 
> default
>   && defaultUri.getAuthority() != null) {  // & default has authority
> return get(defaultUri, conf);  // return default
>   }
> }
> 
> String disableCacheName = String.format("fs.%s.impl.disable.cache", 
> scheme);
> if (conf.getBoolean(disableCacheName, false)) {
>   return createFileSystem(uri, conf);
> }
> return CACHE.get(uri, conf);
>   }
> {code}
> With this, we can specify URI schema and authority related configurations for 
> scanning file systems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32144) Retain EXTERNAL property in hive table properties

2020-06-30 Thread ulysses you (Jira)
ulysses you created SPARK-32144:
---

 Summary: Retain EXTERNAL property in hive table properties
 Key: SPARK-32144
 URL: https://issues.apache.org/jira/browse/SPARK-32144
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32142.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28955
[https://github.com/apache/spark/pull/28955]

> Keep the original tests and codes to avoid potential conflicts in dev in 
> ParquetFilterSuite and ParquetIOSuite
> --
>
> Key: SPARK-32142
> URL: https://issues.apache.org/jira/browse/SPARK-32142
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> There are some unnecessary diff at 
> https://github.com/apache/spark/pull/27728#discussion_r397655390, which could 
> cause potential confusions. It will make more diff when other people match 
> the codes with it.
> It might be best to address the comment and make the diff minimized against 
> other branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32142:


Assignee: Hyukjin Kwon

> Keep the original tests and codes to avoid potential conflicts in dev in 
> ParquetFilterSuite and ParquetIOSuite
> --
>
> Key: SPARK-32142
> URL: https://issues.apache.org/jira/browse/SPARK-32142
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> There are some unnecessary diff at 
> https://github.com/apache/spark/pull/27728#discussion_r397655390, which could 
> cause potential confusions. It will make more diff when other people match 
> the codes with it.
> It might be best to address the comment and make the diff minimized against 
> other branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32131) Fix AnalysisException messages at UNION/EXCEPT/MINUS operations

2020-06-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32131:
--
Summary: Fix AnalysisException messages at UNION/EXCEPT/MINUS operations  
(was: union and set operations have wrong exception infomation)

> Fix AnalysisException messages at UNION/EXCEPT/MINUS operations
> ---
>
> Key: SPARK-32131
> URL: https://issues.apache.org/jira/browse/SPARK-32131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Union and set operations can only be performed on tables with the compatible 
> column types,while when we have more than two column, the warning messages 
> will have wrong column index.Steps to reproduce.
> Step1:prepare test data
> {code:java}
> drop table if exists test1; 
> drop table if exists test2; 
> drop table if exists test3;
> create table if not exists test1(id int, age int, name timestamp);
> create table if not exists test2(id int, age timestamp, name timestamp);
> create table if not exists test3(id int, age int, name int);
> insert into test1 select 1,2,'2020-01-01 01:01:01';
> insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; 
> insert into test3 select 1,3,4;
> {code}
> Step2:do query:
> {code:java}
> Query1:
> select * from test1 except select * from test2;
> Result1:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. timestamp <> int at the second 
> column of the second table;; 'Except false :- Project [id#620, age#621, 
> name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- 
> SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, 
> name#625] (state=,code=0)
> Query2:
> select * from test1 except select * from test3;
> Result2:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the 2th 
> column of the second table;; 'Except false :- Project [id#632, age#633, 
> name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- 
> SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, 
> name#637] (state=,code=0)
> {code}
> the result of query1 is correct, while query2 have the wrong errors,it should 
> be the third column
> Here has the wrong column index.
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *2th* 
> column of the second table+
> We may need to change to the following
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *third* 
> column of the second table+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32131) union and set operations have wrong exception infomation

2020-06-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149108#comment-17149108
 ] 

Dongjoon Hyun commented on SPARK-32131:
---

I also verified that this bug exists at 2.1.3 ~ 2.3.7 and updated the affected 
versions.

> union and set operations have wrong exception infomation
> 
>
> Key: SPARK-32131
> URL: https://issues.apache.org/jira/browse/SPARK-32131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Union and set operations can only be performed on tables with the compatible 
> column types,while when we have more than two column, the warning messages 
> will have wrong column index.Steps to reproduce.
> Step1:prepare test data
> {code:java}
> drop table if exists test1; 
> drop table if exists test2; 
> drop table if exists test3;
> create table if not exists test1(id int, age int, name timestamp);
> create table if not exists test2(id int, age timestamp, name timestamp);
> create table if not exists test3(id int, age int, name int);
> insert into test1 select 1,2,'2020-01-01 01:01:01';
> insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; 
> insert into test3 select 1,3,4;
> {code}
> Step2:do query:
> {code:java}
> Query1:
> select * from test1 except select * from test2;
> Result1:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. timestamp <> int at the second 
> column of the second table;; 'Except false :- Project [id#620, age#621, 
> name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- 
> SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, 
> name#625] (state=,code=0)
> Query2:
> select * from test1 except select * from test3;
> Result2:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the 2th 
> column of the second table;; 'Except false :- Project [id#632, age#633, 
> name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- 
> SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, 
> name#637] (state=,code=0)
> {code}
> the result of query1 is correct, while query2 have the wrong errors,it should 
> be the third column
> Here has the wrong column index.
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *2th* 
> column of the second table+
> We may need to change to the following
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *third* 
> column of the second table+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32131) union and set operations have wrong exception infomation

2020-06-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32131:
--
Affects Version/s: 2.1.3

> union and set operations have wrong exception infomation
> 
>
> Key: SPARK-32131
> URL: https://issues.apache.org/jira/browse/SPARK-32131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Union and set operations can only be performed on tables with the compatible 
> column types,while when we have more than two column, the warning messages 
> will have wrong column index.Steps to reproduce.
> Step1:prepare test data
> {code:java}
> drop table if exists test1; 
> drop table if exists test2; 
> drop table if exists test3;
> create table if not exists test1(id int, age int, name timestamp);
> create table if not exists test2(id int, age timestamp, name timestamp);
> create table if not exists test3(id int, age int, name int);
> insert into test1 select 1,2,'2020-01-01 01:01:01';
> insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; 
> insert into test3 select 1,3,4;
> {code}
> Step2:do query:
> {code:java}
> Query1:
> select * from test1 except select * from test2;
> Result1:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. timestamp <> int at the second 
> column of the second table;; 'Except false :- Project [id#620, age#621, 
> name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- 
> SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, 
> name#625] (state=,code=0)
> Query2:
> select * from test1 except select * from test3;
> Result2:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the 2th 
> column of the second table;; 'Except false :- Project [id#632, age#633, 
> name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- 
> SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, 
> name#637] (state=,code=0)
> {code}
> the result of query1 is correct, while query2 have the wrong errors,it should 
> be the third column
> Here has the wrong column index.
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *2th* 
> column of the second table+
> We may need to change to the following
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *third* 
> column of the second table+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32131) union and set operations have wrong exception infomation

2020-06-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32131:
--
Affects Version/s: 2.2.3

> union and set operations have wrong exception infomation
> 
>
> Key: SPARK-32131
> URL: https://issues.apache.org/jira/browse/SPARK-32131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Union and set operations can only be performed on tables with the compatible 
> column types,while when we have more than two column, the warning messages 
> will have wrong column index.Steps to reproduce.
> Step1:prepare test data
> {code:java}
> drop table if exists test1; 
> drop table if exists test2; 
> drop table if exists test3;
> create table if not exists test1(id int, age int, name timestamp);
> create table if not exists test2(id int, age timestamp, name timestamp);
> create table if not exists test3(id int, age int, name int);
> insert into test1 select 1,2,'2020-01-01 01:01:01';
> insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; 
> insert into test3 select 1,3,4;
> {code}
> Step2:do query:
> {code:java}
> Query1:
> select * from test1 except select * from test2;
> Result1:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. timestamp <> int at the second 
> column of the second table;; 'Except false :- Project [id#620, age#621, 
> name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- 
> SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, 
> name#625] (state=,code=0)
> Query2:
> select * from test1 except select * from test3;
> Result2:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the 2th 
> column of the second table;; 'Except false :- Project [id#632, age#633, 
> name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- 
> SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, 
> name#637] (state=,code=0)
> {code}
> the result of query1 is correct, while query2 have the wrong errors,it should 
> be the third column
> Here has the wrong column index.
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *2th* 
> column of the second table+
> We may need to change to the following
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *third* 
> column of the second table+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32131) union and set operations have wrong exception infomation

2020-06-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32131:
--
Affects Version/s: 2.3.4

> union and set operations have wrong exception infomation
> 
>
> Key: SPARK-32131
> URL: https://issues.apache.org/jira/browse/SPARK-32131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Union and set operations can only be performed on tables with the compatible 
> column types,while when we have more than two column, the warning messages 
> will have wrong column index.Steps to reproduce.
> Step1:prepare test data
> {code:java}
> drop table if exists test1; 
> drop table if exists test2; 
> drop table if exists test3;
> create table if not exists test1(id int, age int, name timestamp);
> create table if not exists test2(id int, age timestamp, name timestamp);
> create table if not exists test3(id int, age int, name int);
> insert into test1 select 1,2,'2020-01-01 01:01:01';
> insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; 
> insert into test3 select 1,3,4;
> {code}
> Step2:do query:
> {code:java}
> Query1:
> select * from test1 except select * from test2;
> Result1:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. timestamp <> int at the second 
> column of the second table;; 'Except false :- Project [id#620, age#621, 
> name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- 
> SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, 
> name#625] (state=,code=0)
> Query2:
> select * from test1 except select * from test3;
> Result2:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the 2th 
> column of the second table;; 'Except false :- Project [id#632, age#633, 
> name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- 
> SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, 
> name#637] (state=,code=0)
> {code}
> the result of query1 is correct, while query2 have the wrong errors,it should 
> be the third column
> Here has the wrong column index.
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *2th* 
> column of the second table+
> We may need to change to the following
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *third* 
> column of the second table+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32131) union and set operations have wrong exception infomation

2020-06-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32131:
--
Affects Version/s: 2.4.6

> union and set operations have wrong exception infomation
> 
>
> Key: SPARK-32131
> URL: https://issues.apache.org/jira/browse/SPARK-32131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: philipse
>Priority: Minor
>
> Union and set operations can only be performed on tables with the compatible 
> column types,while when we have more than two column, the warning messages 
> will have wrong column index.Steps to reproduce.
> Step1:prepare test data
> {code:java}
> drop table if exists test1; 
> drop table if exists test2; 
> drop table if exists test3;
> create table if not exists test1(id int, age int, name timestamp);
> create table if not exists test2(id int, age timestamp, name timestamp);
> create table if not exists test3(id int, age int, name int);
> insert into test1 select 1,2,'2020-01-01 01:01:01';
> insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; 
> insert into test3 select 1,3,4;
> {code}
> Step2:do query:
> {code:java}
> Query1:
> select * from test1 except select * from test2;
> Result1:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. timestamp <> int at the second 
> column of the second table;; 'Except false :- Project [id#620, age#621, 
> name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- 
> SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, 
> name#625] (state=,code=0)
> Query2:
> select * from test1 except select * from test3;
> Result2:
> Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the 2th 
> column of the second table;; 'Except false :- Project [id#632, age#633, 
> name#634] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation 
> `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [id#632, age#633, name#634] +- Project [id#635, age#636, name#637] +- 
> SubqueryAlias `default`.`test3` +- HiveTableRelation `default`.`test3`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#635, age#636, 
> name#637] (state=,code=0)
> {code}
> the result of query1 is correct, while query2 have the wrong errors,it should 
> be the third column
> Here has the wrong column index.
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *2th* 
> column of the second table+
> We may need to change to the following
> +Error: org.apache.spark.sql.AnalysisException: Except can only be performed 
> on tables with the compatible column types. int <> timestamp at the *third* 
> column of the second table+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties

2020-06-30 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149104#comment-17149104
 ] 

L. C. Hsieh commented on SPARK-32136:
-

Thanks for the ping. Will look at this.

> Spark producing incorrect groupBy results when key is a struct with nullable 
> properties
> ---
>
> Key: SPARK-32136
> URL: https://issues.apache.org/jira/browse/SPARK-32136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Moore
>Priority: Major
>
> I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
> noticing a behaviour change in a particular aggregation we're doing, and I 
> think I've tracked it down to how Spark is now treating nullable properties 
> within the column being grouped by.
>  
> Here's a simple test I've been able to set up to repro it:
>  
> {code:scala}
> case class B(c: Option[Double])
> case class A(b: Option[B])
> val df = Seq(
> A(None),
> A(Some(B(None))),
> A(Some(B(Some(1.0
> ).toDF
> val res = df.groupBy("b").agg(count("*"))
> {code}
> Spark 2.4.6 has the expected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   1|
> | null|   1|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],1]
> [null,1]
> [[1.0],1]
> {noformat}
> But Spark 3.0.0 has an unexpected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   2|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],2]
> [[1.0],1]
> {noformat}
> Notice how it has keyed one of the values in be as `[null]`; that is, an 
> instance of B with a null value for the `c` property instead of a null for 
> the overall value itself.
> Is this an intended change?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties

2020-06-30 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149103#comment-17149103
 ] 

Wenchen Fan commented on SPARK-32136:
-

cc [~viirya]

> Spark producing incorrect groupBy results when key is a struct with nullable 
> properties
> ---
>
> Key: SPARK-32136
> URL: https://issues.apache.org/jira/browse/SPARK-32136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Moore
>Priority: Major
>
> I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
> noticing a behaviour change in a particular aggregation we're doing, and I 
> think I've tracked it down to how Spark is now treating nullable properties 
> within the column being grouped by.
>  
> Here's a simple test I've been able to set up to repro it:
>  
> {code:scala}
> case class B(c: Option[Double])
> case class A(b: Option[B])
> val df = Seq(
> A(None),
> A(Some(B(None))),
> A(Some(B(Some(1.0
> ).toDF
> val res = df.groupBy("b").agg(count("*"))
> {code}
> Spark 2.4.6 has the expected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   1|
> | null|   1|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],1]
> [null,1]
> [[1.0],1]
> {noformat}
> But Spark 3.0.0 has an unexpected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   2|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],2]
> [[1.0],1]
> {noformat}
> Notice how it has keyed one of the values in be as `[null]`; that is, an 
> instance of B with a null value for the `c` property instead of a null for 
> the overall value itself.
> Is this an intended change?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149098#comment-17149098
 ] 

Apache Spark commented on SPARK-32143:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/28961

> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew and split too many partitions, the plan 
> generation may take a very long time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
> In below logs we can see that it took over 1 hour to generate the plan in AQE 
> when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32143:


Assignee: (was: Apache Spark)

> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew and split too many partitions, the plan 
> generation may take a very long time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
> In below logs we can see that it took over 1 hour to generate the plan in AQE 
> when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149095#comment-17149095
 ] 

Apache Spark commented on SPARK-32143:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/28961

> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew and split too many partitions, the plan 
> generation may take a very long time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
> In below logs we can see that it took over 1 hour to generate the plan in AQE 
> when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32143:


Assignee: Apache Spark

> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew and split too many partitions, the plan 
> generation may take a very long time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
> In below logs we can see that it took over 1 hour to generate the plan in AQE 
> when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32143:
---
Description: 
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew and split too many partitions, the plan 
generation may take a very long time.

Even fallback to normal SMJ, the query cannot success either. So we should fast 
fail this query.

In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}

  was:
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the plan generation may take a very long 
time.

Even fallback to normal SMJ, the query cannot success either. So we should fast 
fail this query.

In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}


> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew and split too many partitions, the plan 
> generation may take a very long time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
> In below logs we can see that it took over 1 hour to generate the plan in AQE 
> when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32143:
---
Description: 
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the plan generation may take a very long 
time.

Even fallback to normal SMJ, the query cannot success either. So we should fast 
fail this query.

In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}

  was:
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the the plan generation may take a very long 
time.

Even fallback to normal SMJ, the query cannot success either. So we should fast 
fail this query.


 In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}


> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew, the plan generation may take a very long 
> time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
> In below logs we can see that it took over 1 hour to generate the plan in AQE 
> when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149084#comment-17149084
 ] 

Lantao Jin commented on SPARK-32143:


A PR will be submitted soon.

> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew, the the plan generation may take a very 
> long time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
>  In below logs we can see that it took over 1 hour to generate the plan in 
> AQE when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32143:
---
Description: 
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the the plan generation may take a very long 
time.

Even fallback to normal SMJ, the query cannot success either. So we should fast 
fail this query.


 In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}

  was:
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the the plan generation may take a very long 
time. Actually, this query cannot success.
 In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}


> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew, the the plan generation may take a very 
> long time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
>  In below logs we can see that it took over 1 hour to generate the plan in 
> AQE when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32143:
--

 Summary: Fast fail when the AQE skew join produce too many splits
 Key: SPARK-32143
 URL: https://issues.apache.org/jira/browse/SPARK-32143
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Lantao Jin


In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the the plan generation may take a very long 
time. Actually, this query cannot success.
 In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2020-06-30 Thread Jason Moore (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149083#comment-17149083
 ] 

Jason Moore commented on SPARK-27780:
-

I encounter this on 3.0.0 running with a much older shuffle service (I think 
like 2.1.1).  Is there any documentation on which shuffle services 3.0.0 will 
work with?

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties

2020-06-30 Thread Jason Moore (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149073#comment-17149073
 ] 

Jason Moore commented on SPARK-32136:
-

Here is a similar test, and why it's a problem for what I'm needing to do:


{noformat}
case class C(d: Double)
case class B(c: Option[C])
case class A(b: Option[B])
val df = Seq(
  A(None),
  A(Some(B(None))),
  A(Some(B(Some(C(1.0)
).toDF
val res = df.groupBy("b").agg(count("*"))

> res.show
+---++
|  b|count(1)|
+---++
|   [[]]|   2|
|[[1.0]]|   1|
+---++

> res.as[(Option[B], Long)].collect
java.lang.RuntimeException: Error while decoding: 
java.lang.NullPointerException: Null value appeared in non-nullable field:
- field (class: "scala.Double", name: "d")
- option value class: "C"
- field (class: "scala.Option", name: "c")
- option value class: "B"
- field (class: "scala.Option", name: "_1")
- root class: "scala.Tuple2"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
newInstance(class scala.Tuple2)
{noformat}

Interestingly, and potentially usefully to know, that using an Int instead of a 
Double above works as expected.

> Spark producing incorrect groupBy results when key is a struct with nullable 
> properties
> ---
>
> Key: SPARK-32136
> URL: https://issues.apache.org/jira/browse/SPARK-32136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Moore
>Priority: Major
>
> I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
> noticing a behaviour change in a particular aggregation we're doing, and I 
> think I've tracked it down to how Spark is now treating nullable properties 
> within the column being grouped by.
>  
> Here's a simple test I've been able to set up to repro it:
>  
> {code:scala}
> case class B(c: Option[Double])
> case class A(b: Option[B])
> val df = Seq(
> A(None),
> A(Some(B(None))),
> A(Some(B(Some(1.0
> ).toDF
> val res = df.groupBy("b").agg(count("*"))
> {code}
> Spark 2.4.6 has the expected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   1|
> | null|   1|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],1]
> [null,1]
> [[1.0],1]
> {noformat}
> But Spark 3.0.0 has an unexpected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   2|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],2]
> [[1.0],1]
> {noformat}
> Notice how it has keyed one of the values in be as `[null]`; that is, an 
> instance of B with a null value for the `c` property instead of a null for 
> the overall value itself.
> Is this an intended change?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32088) test of pyspark.sql.functions.timestamp_seconds failed if non-american timezone setting

2020-06-30 Thread huangtianhua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149044#comment-17149044
 ] 

huangtianhua commented on SPARK-32088:
--

the test success by the modification 
https://github.com/apache/spark/pull/28959, [~maxgekk], thanks very much:)

> test of pyspark.sql.functions.timestamp_seconds failed if non-american 
> timezone setting
> ---
>
> Key: SPARK-32088
> URL: https://issues.apache.org/jira/browse/SPARK-32088
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: huangtianhua
>Assignee: philipse
>Priority: Major
> Fix For: 3.1.0
>
>
> The python test failed for aarch64 job, see 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/405/console
>  since the commit 
> https://github.com/apache/spark/commit/f0e6d0ec13d9cdadf341d1b976623345bcdb1028#diff-c8de34467c555857b92875bf78bf9d49
>  merged:
> **
> File 
> "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/sql/functions.py",
>  line 1435, in pyspark.sql.functions.timestamp_seconds
> Failed example:
> time_df.select(timestamp_seconds(time_df.unix_time).alias('ts')).collect()
> Expected:
> [Row(ts=datetime.datetime(2008, 12, 25, 7, 30))]
> Got:
> [Row(ts=datetime.datetime(2008, 12, 25, 23, 30))]
> **
>1 of   3 in pyspark.sql.functions.timestamp_seconds
> ***Test Failed*** 1 failures.
> But this is not arm64-related issue, I took test on x86 instance with 
> timezone setting of UTC, then the test failed too, so I think the expected 
> datetime is timezone American/**, but seems we have not set the timezone when 
> doing these timezone sensitive python tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite

2020-06-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32142:


Assignee: (was: Apache Spark)

> Keep the original tests and codes to avoid potential conflicts in dev in 
> ParquetFilterSuite and ParquetIOSuite
> --
>
> Key: SPARK-32142
> URL: https://issues.apache.org/jira/browse/SPARK-32142
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There are some unnecessary diff at 
> https://github.com/apache/spark/pull/27728#discussion_r397655390, which could 
> cause potential confusions. It will make more diff when other people match 
> the codes with it.
> It might be best to address the comment and make the diff minimized against 
> other branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite

2020-06-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32142:


Assignee: Apache Spark

> Keep the original tests and codes to avoid potential conflicts in dev in 
> ParquetFilterSuite and ParquetIOSuite
> --
>
> Key: SPARK-32142
> URL: https://issues.apache.org/jira/browse/SPARK-32142
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> There are some unnecessary diff at 
> https://github.com/apache/spark/pull/27728#discussion_r397655390, which could 
> cause potential confusions. It will make more diff when other people match 
> the codes with it.
> It might be best to address the comment and make the diff minimized against 
> other branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149035#comment-17149035
 ] 

Apache Spark commented on SPARK-32142:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28955

> Keep the original tests and codes to avoid potential conflicts in dev in 
> ParquetFilterSuite and ParquetIOSuite
> --
>
> Key: SPARK-32142
> URL: https://issues.apache.org/jira/browse/SPARK-32142
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There are some unnecessary diff at 
> https://github.com/apache/spark/pull/27728#discussion_r397655390, which could 
> cause potential confusions. It will make more diff when other people match 
> the codes with it.
> It might be best to address the comment and make the diff minimized against 
> other branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32142) Keep the original tests and codes to avoid potential conflicts in dev in ParquetFilterSuite and ParquetIOSuite

2020-06-30 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32142:


 Summary: Keep the original tests and codes to avoid potential 
conflicts in dev in ParquetFilterSuite and ParquetIOSuite
 Key: SPARK-32142
 URL: https://issues.apache.org/jira/browse/SPARK-32142
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


There are some unnecessary diff at 
https://github.com/apache/spark/pull/27728#discussion_r397655390, which could 
cause potential confusions. It will make more diff when other people match the 
codes with it.

It might be best to address the comment and make the diff minimized against 
other branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32141) Repartition leads to out of memory

2020-06-30 Thread Lekshmi Nair (Jira)
Lekshmi Nair created SPARK-32141:


 Summary: Repartition leads to out of memory
 Key: SPARK-32141
 URL: https://issues.apache.org/jira/browse/SPARK-32141
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 2.4.4
Reporter: Lekshmi Nair


 We  have an application that does aggregation on 7 columns. In order to avoid 
shuffles we thought of doing repartition on those 7 columns. It works well with 
1 to 4tb of data. When it gets over 4Tb, it  fails with OOM or disk space.

 

Do we have a better  approach to reduce the shuffle ? For our biggest dataset, 
the spark job never ran with repartition.  We are out of options.

 

We do have a 24 node cluster with r5.24X machines and 1TB of disk.  Our shuffle 
partition is set to 6912. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149019#comment-17149019
 ] 

Jungtaek Lim commented on SPARK-32130:
--

There might be some tricks to make type inference for timestamps fail faster, 
but that would depend on the value and will be same (even worse for adding 
overhead on applying tricks) for worst case. Unless the performance becomes on 
par or similar on every cases, turning it on by default doesn't seem to be 
ideal.

[~dongjoon] [~cloud_fan] [~hyukjin.kwon] [~maxgekk] Could we please revisit 
this?

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
> To reproduce the issue please untar the attachment - it will have multiple 
> .json.gz files whose contents will look similar to following
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> 

[jira] [Commented] (SPARK-32140) Add summary to FMClassificationModel

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148985#comment-17148985
 ] 

Apache Spark commented on SPARK-32140:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28960

> Add summary to FMClassificationModel
> 
>
> Key: SPARK-32140
> URL: https://issues.apache.org/jira/browse/SPARK-32140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32140) Add summary to FMClassificationModel

2020-06-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32140:


Assignee: (was: Apache Spark)

> Add summary to FMClassificationModel
> 
>
> Key: SPARK-32140
> URL: https://issues.apache.org/jira/browse/SPARK-32140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32140) Add summary to FMClassificationModel

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148984#comment-17148984
 ] 

Apache Spark commented on SPARK-32140:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28960

> Add summary to FMClassificationModel
> 
>
> Key: SPARK-32140
> URL: https://issues.apache.org/jira/browse/SPARK-32140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32140) Add summary to FMClassificationModel

2020-06-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32140:


Assignee: Apache Spark

> Add summary to FMClassificationModel
> 
>
> Key: SPARK-32140
> URL: https://issues.apache.org/jira/browse/SPARK-32140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28664) ORDER BY in aggregate function

2020-06-30 Thread Will Zimmerman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148952#comment-17148952
 ] 

Will Zimmerman commented on SPARK-28664:


[~yumwang] - Would this allow for the changing of Null ordering (e.g. NULL 
FIRST or NULL LAST)? Currently when evaluating the MIN() of a structure, NULL 
appears to be taken as the minimum possible value, where as it would be nice if 
the NULL ordering could be changed or evaluation of structs could be similar to 
MIN() of a value (where NULL are ignored unless that is the only option). Also, 
I can't find in Spark documentation what the expected behavior of NULL values 
is within a struct. The documention I'm referring to can be found 
[https://spark.apache.org/docs/3.0.0-preview/sql-ref-null-semantics.html].


{code:java}
SELECT ID 
   ,COLLECT_SET(STRUCT(x,y)) AS collection
   ,MIN(x) AS min_of_x
   ,MIN(y) AS min_of_y
   ,MIN(STRUCT(x,y)) AS min_of_collection
   ,MAX(STRUCT(x,y)) AS max_of_collection
FROM (values(1234390, 12.0, 'string_1'), (1234390, 37.4, 'string_2'), (1234390, 
6.9, NULL), (1234390, 3.1, 'string_3'), (1234390, NULL, 'string_4'), (1234390, 
NULL, NULL)) AS d(ID,x,y)
GROUP BY 1
{code}
Result of Spark SQL query in Spark 2.4.4 can be seen below, where the desired 
outcome would be (3.1, string_3).

!image-2020-06-30-15-49-46-796.png!

> ORDER BY in aggregate function
> --
>
> Key: SPARK-28664
> URL: https://issues.apache.org/jira/browse/SPARK-28664
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: image-2020-06-30-15-49-46-796.png
>
>
> {code:sql}
> SELECT min(x ORDER BY y) FROM (VALUES(1, NULL)) AS d(x,y);
> SELECT min(x ORDER BY y) FROM (VALUES(1, 2)) AS d(x,y);
> {code}
> https://github.com/postgres/postgres/blob/44e95b5728a4569c494fa4ea4317f8a2f50a206b/src/test/regress/sql/aggregates.sql#L978-L982



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28664) ORDER BY in aggregate function

2020-06-30 Thread Will Zimmerman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Zimmerman updated SPARK-28664:
---
Attachment: image-2020-06-30-15-49-46-796.png

> ORDER BY in aggregate function
> --
>
> Key: SPARK-28664
> URL: https://issues.apache.org/jira/browse/SPARK-28664
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: image-2020-06-30-15-49-46-796.png
>
>
> {code:sql}
> SELECT min(x ORDER BY y) FROM (VALUES(1, NULL)) AS d(x,y);
> SELECT min(x ORDER BY y) FROM (VALUES(1, 2)) AS d(x,y);
> {code}
> https://github.com/postgres/postgres/blob/44e95b5728a4569c494fa4ea4317f8a2f50a206b/src/test/regress/sql/aggregates.sql#L978-L982



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20249) Add summary for LinearSVCModel

2020-06-30 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-20249:
---
Parent: SPARK-32139
Issue Type: Sub-task  (was: Improvement)

> Add summary for LinearSVCModel
> --
>
> Key: SPARK-20249
> URL: https://issues.apache.org/jira/browse/SPARK-20249
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Jeff Zhang
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.1.0
>
>
> I'd like to get the ObjectiveHistory of each iteration. So I want to add 
> summary for LinearSVCModel as LogisticRegression. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23631) Add summary to RandomForestClassificationModel

2020-06-30 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-23631:
---
Parent: SPARK-32139
Issue Type: Sub-task  (was: New Feature)

> Add summary to RandomForestClassificationModel
> --
>
> Key: SPARK-23631
> URL: https://issues.apache.org/jira/browse/SPARK-23631
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Evan Zamir
>Priority: Major
>  Labels: bulk-closed
>
> I'm using the RandomForestClassificationModel and noticed that there is no 
> summary attribute like there is for LogisticRegressionModel. Specifically, 
> I'd like to have the roc and pr curves. Is that on the Spark roadmap 
> anywhere? Is there a reason it hasn't been implemented?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32140) Add summary to FMClassificationModel

2020-06-30 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-32140:
--

 Summary: Add summary to FMClassificationModel
 Key: SPARK-32140
 URL: https://issues.apache.org/jira/browse/SPARK-32140
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.1.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31893) Add a generic ClassificationSummary trait

2020-06-30 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-31893:
---
Parent: SPARK-32139
Issue Type: Sub-task  (was: Improvement)

> Add a generic ClassificationSummary trait
> -
>
> Key: SPARK-31893
> URL: https://issues.apache.org/jira/browse/SPARK-31893
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.1.0
>
>
> Add a generic ClassificationSummary trait, so all the classification models 
> can use it to implement model summary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32139) Unify Classification Training Summary

2020-06-30 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-32139:
--

 Summary: Unify Classification Training Summary
 Key: SPARK-32139
 URL: https://issues.apache.org/jira/browse/SPARK-32139
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.1.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148936#comment-17148936
 ] 

Sean R. Owen commented on SPARK-32130:
--

So, is the issue that it's trying and failing to parse millions of things as 
timestamps, as part of schema inference, and almost all of that is wasted 
because they're not timestamps? is it possible to make the type inference for 
timestamps fail way faster on inputs that are obviously not dates (e.g. don't 
start with a number?) 

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
> To reproduce the issue please untar the attachment - it will have multiple 
> .json.gz files whose contents will look similar to following
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx 

[jira] [Commented] (SPARK-32088) test of pyspark.sql.functions.timestamp_seconds failed if non-american timezone setting

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148913#comment-17148913
 ] 

Apache Spark commented on SPARK-32088:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28959

> test of pyspark.sql.functions.timestamp_seconds failed if non-american 
> timezone setting
> ---
>
> Key: SPARK-32088
> URL: https://issues.apache.org/jira/browse/SPARK-32088
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: huangtianhua
>Assignee: philipse
>Priority: Major
> Fix For: 3.1.0
>
>
> The python test failed for aarch64 job, see 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/405/console
>  since the commit 
> https://github.com/apache/spark/commit/f0e6d0ec13d9cdadf341d1b976623345bcdb1028#diff-c8de34467c555857b92875bf78bf9d49
>  merged:
> **
> File 
> "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/sql/functions.py",
>  line 1435, in pyspark.sql.functions.timestamp_seconds
> Failed example:
> time_df.select(timestamp_seconds(time_df.unix_time).alias('ts')).collect()
> Expected:
> [Row(ts=datetime.datetime(2008, 12, 25, 7, 30))]
> Got:
> [Row(ts=datetime.datetime(2008, 12, 25, 23, 30))]
> **
>1 of   3 in pyspark.sql.functions.timestamp_seconds
> ***Test Failed*** 1 failures.
> But this is not arm64-related issue, I took test on x86 instance with 
> timezone setting of UTC, then the test failed too, so I think the expected 
> datetime is timezone American/**, but seems we have not set the timezone when 
> doing these timezone sensitive python tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows

2020-06-30 Thread Cheng Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Pan updated SPARK-32121:
--
Issue Type: Bug  (was: Test)

> ExternalShuffleBlockResolverSuite failed on Windows
> ---
>
> Key: SPARK-32121
> URL: https://issues.apache.org/jira/browse/SPARK-32121
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0, 3.0.1
> Environment: Windows 10
>Reporter: Cheng Pan
>Priority: Minor
>
> The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} 
> should consider the Windows file separator.
> {code}
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 
> s <<< FAILURE! - in 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite
> [ERROR] 
> testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite)
>   Time elapsed: 0 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected: but 
> was:
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31336) Support Oracle Kerberos login in JDBC connector

2020-06-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31336.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28863
[https://github.com/apache/spark/pull/28863]

> Support Oracle Kerberos login in JDBC connector
> ---
>
> Key: SPARK-31336
> URL: https://issues.apache.org/jira/browse/SPARK-31336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31935) Hadoop file system config should be effective in data source options

2020-06-30 Thread Cheng Lian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-31935:
---
Affects Version/s: (was: 3.0.1)
   (was: 3.1.0)
   2.4.6
   3.0.0

> Hadoop file system config should be effective in data source options 
> -
>
> Key: SPARK-31935
> URL: https://issues.apache.org/jira/browse/SPARK-31935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Data source options should be propagated into the hadoop configuration of 
> method `checkAndGlobPathIfNecessary`
> From org.apache.hadoop.fs.FileSystem.java:
> {code:java}
>   public static FileSystem get(URI uri, Configuration conf) throws 
> IOException {
> String scheme = uri.getScheme();
> String authority = uri.getAuthority();
> if (scheme == null && authority == null) { // use default FS
>   return get(conf);
> }
> if (scheme != null && authority == null) { // no authority
>   URI defaultUri = getDefaultUri(conf);
>   if (scheme.equals(defaultUri.getScheme())// if scheme matches 
> default
>   && defaultUri.getAuthority() != null) {  // & default has authority
> return get(defaultUri, conf);  // return default
>   }
> }
> 
> String disableCacheName = String.format("fs.%s.impl.disable.cache", 
> scheme);
> if (conf.getBoolean(disableCacheName, false)) {
>   return createFileSystem(uri, conf);
> }
> return CACHE.get(uri, conf);
>   }
> {code}
> With this, we can specify URI schema and authority related configurations for 
> scanning file systems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31336) Support Oracle Kerberos login in JDBC connector

2020-06-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31336:
-

Assignee: Gabor Somogyi

> Support Oracle Kerberos login in JDBC connector
> ---
>
> Key: SPARK-31336
> URL: https://issues.apache.org/jira/browse/SPARK-31336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32132) Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0

2020-06-30 Thread Juliusz Sompolski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148848#comment-17148848
 ] 

Juliusz Sompolski commented on SPARK-32132:
---

Also 2.4 adds "interval" at the start, while 3.0 does not. E.g. "interval 3 
days" in 2.4 and "3 days" in 3.0.
I actually think that the new 3.0 results are better / more standard, and I 
haven't heard about anyone complaining that it broke the way they parse it.

Edit: [~cloud_fan] posting now the above comment that I thought I posted 
yesterday, but it stayed open and not send in an open tab. It causes some 
issues with unit tests, but I think it shouldn't cause real world problems, and 
in any case the new format is likely better for the future. Thanks for 
explaining.

> Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0
> --
>
> Key: SPARK-32132
> URL: https://issues.apache.org/jira/browse/SPARK-32132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> In https://github.com/apache/spark/pull/26418, a setting 
> spark.sql.dialect.intervalOutputStyle was implemented, to control interval 
> output style. This PR also removed "toString" from CalendarInterval. This 
> change got reverted in https://github.com/apache/spark/pull/27304, and the 
> CalendarInterval.toString got implemented back in 
> https://github.com/apache/spark/pull/26572.
> But it behaves differently now: In 2.4 "4 weeks 2 days" are returned, and 3.0 
> returns "30 days".
> Thriftserver uses HiveResults.toHiveString, which uses 
> CalendarInterval.toString to return interval results as string.  The results 
> are now different in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31935) Hadoop file system config should be effective in data source options

2020-06-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31935:

Issue Type: Bug  (was: Improvement)

> Hadoop file system config should be effective in data source options 
> -
>
> Key: SPARK-31935
> URL: https://issues.apache.org/jira/browse/SPARK-31935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Data source options should be propagated into the hadoop configuration of 
> method `checkAndGlobPathIfNecessary`
> From org.apache.hadoop.fs.FileSystem.java:
> {code:java}
>   public static FileSystem get(URI uri, Configuration conf) throws 
> IOException {
> String scheme = uri.getScheme();
> String authority = uri.getAuthority();
> if (scheme == null && authority == null) { // use default FS
>   return get(conf);
> }
> if (scheme != null && authority == null) { // no authority
>   URI defaultUri = getDefaultUri(conf);
>   if (scheme.equals(defaultUri.getScheme())// if scheme matches 
> default
>   && defaultUri.getAuthority() != null) {  // & default has authority
> return get(defaultUri, conf);  // return default
>   }
> }
> 
> String disableCacheName = String.format("fs.%s.impl.disable.cache", 
> scheme);
> if (conf.getBoolean(disableCacheName, false)) {
>   return createFileSystem(uri, conf);
> }
> return CACHE.get(uri, conf);
>   }
> {code}
> With this, we can specify URI schema and authority related configurations for 
> scanning file systems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32132) Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0

2020-06-30 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148836#comment-17148836
 ] 

Wenchen Fan commented on SPARK-32132:
-

We did it intentionally in 
https://github.com/apache/spark/commit/fb60c2a170b40e80285c1255f0635d5ab319cd35 
.

INTERVAL is an internal data type and can't be written out, and we don't 
guarantee the string representation stability of interval values. It's indeed 
hard to read the interval string like "4 weeks 2 days", and I think it's better 
to use "30 days". Does it cause issues at thriftserver side?

> Thriftserver interval returns "4 weeks 2 days" in 2.4 and "30 days" in 3.0
> --
>
> Key: SPARK-32132
> URL: https://issues.apache.org/jira/browse/SPARK-32132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> In https://github.com/apache/spark/pull/26418, a setting 
> spark.sql.dialect.intervalOutputStyle was implemented, to control interval 
> output style. This PR also removed "toString" from CalendarInterval. This 
> change got reverted in https://github.com/apache/spark/pull/27304, and the 
> CalendarInterval.toString got implemented back in 
> https://github.com/apache/spark/pull/26572.
> But it behaves differently now: In 2.4 "4 weeks 2 days" are returned, and 3.0 
> returns "30 days".
> Thriftserver uses HiveResults.toHiveString, which uses 
> CalendarInterval.toString to return interval results as string.  The results 
> are now different in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32026) Add PrometheusServletSuite

2020-06-30 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-32026:
---
Description: 
This Jira aims to be added _PrometheusServletSuite_.

*Note:* This jira was created to propose _Prometheus_ driver metric format 
change to be consistent with executor metric format. However, currently, 
_PrometheusServlet_ follows _Spark 3.0 JMX Sink + Prometheus JMX Exporter_ 
format so this Jira coverage has been converted to just 
_PrometheusServletSuite_ in the light of discussion on 
https://github.com/apache/spark/pull/28865.
 

  was:
Spark 3.0 introduces native Prometheus Sink for both driver and executor 
metrics. However, they need consistency on format (e.g: `applicationId`). 
Currently, driver covers `applicationId` in metric name. If this can extract as 
executor metric format, this can also support consistency and help to query.

*Driver*
{code:java}
metrics_local_1592242896665_driver_BlockManager_memory_memUsed_MB_Value{type="gauges"}
 0{code}

*Executor*
{code:java}
metrics_executor_memoryUsed_bytes{application_id="local-1592242896665", 
application_name="apache-spark-fundamentals", executor_id="driver"} 24356{code}

*Proposed Driver Format*
{code:java}
metrics_driver_BlockManager_memory_memUsed_MB_Value{application_id="local-1592242896665",
 type="gauges"} 0{code}


> Add PrometheusServletSuite
> --
>
> Key: SPARK-32026
> URL: https://issues.apache.org/jira/browse/SPARK-32026
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Eren Avsarogullari
>Priority: Major
>
> This Jira aims to be added _PrometheusServletSuite_.
> *Note:* This jira was created to propose _Prometheus_ driver metric format 
> change to be consistent with executor metric format. However, currently, 
> _PrometheusServlet_ follows _Spark 3.0 JMX Sink + Prometheus JMX Exporter_ 
> format so this Jira coverage has been converted to just 
> _PrometheusServletSuite_ in the light of discussion on 
> https://github.com/apache/spark/pull/28865.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-30 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148790#comment-17148790
 ] 

Kousuke Saruta commented on SPARK-32119:


Yeah I know it works with extraClassPath but as you mention, it's a limitation.

> ExecutorPlugin doesn't work with Standalone Cluster
> ---
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation

2020-06-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32138:


Assignee: Apache Spark

> Drop Python 2, 3.4 and 3.5 in codes and documentation
> -
>
> Key: SPARK-32138
> URL: https://issues.apache.org/jira/browse/SPARK-32138
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation

2020-06-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32138:


Assignee: (was: Apache Spark)

> Drop Python 2, 3.4 and 3.5 in codes and documentation
> -
>
> Key: SPARK-32138
> URL: https://issues.apache.org/jira/browse/SPARK-32138
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148766#comment-17148766
 ] 

Apache Spark commented on SPARK-32138:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28957

> Drop Python 2, 3.4 and 3.5 in codes and documentation
> -
>
> Key: SPARK-32138
> URL: https://issues.apache.org/jira/browse/SPARK-32138
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32138:
-
Summary: Drop Python 2, 3.4 and 3.5 in codes and documentation  (was: Drop 
Python 2, 3.4 and 3.5 in the main and dev codes)

> Drop Python 2, 3.4 and 3.5 in codes and documentation
> -
>
> Key: SPARK-32138
> URL: https://issues.apache.org/jira/browse/SPARK-32138
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29802) Update remaining python scripts in repo to python3 shebang

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29802:
-
Summary: Update remaining python scripts in repo to python3 shebang  (was: 
update remaining python scripts in repo to python3 shebang)

> Update remaining python scripts in repo to python3 shebang
> --
>
> Key: SPARK-29802
> URL: https://issues.apache.org/jira/browse/SPARK-29802
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
>
> there are a bunch of scripts in the repo that need to have their shebang 
> updated to python3:
> {noformat}
> dev/create-release/releaseutils.py:#!/usr/bin/env python
> dev/create-release/generate-contributors.py:#!/usr/bin/env python
> dev/create-release/translate-contributors.py:#!/usr/bin/env python
> dev/github_jira_sync.py:#!/usr/bin/env python
> dev/merge_spark_pr.py:#!/usr/bin/env python
> python/pyspark/version.py:#!/usr/bin/env python
> python/pyspark/find_spark_home.py:#!/usr/bin/env python
> python/setup.py:#!/usr/bin/env python{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29919) Remove python2 test execution in Jenkins environment

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29919:
-
Summary: Remove python2 test execution in Jenkins environment  (was: remove 
python2 test execution)

> Remove python2 test execution in Jenkins environment
> 
>
> Key: SPARK-29919
> URL: https://issues.apache.org/jira/browse/SPARK-29919
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> remove python2.7 (including pypy2) test executables from 'python/run-tests.py'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29803) remove all instances of 'from __future__ import print_function'

2020-06-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148755#comment-17148755
 ] 

Hyukjin Kwon edited comment on SPARK-29803 at 6/30/20, 3:07 PM:


I will do it at SPARK-32138


was (Author: hyukjin.kwon):
I will do it at SPARK-29909 

> remove all instances of 'from __future__ import print_function' 
> 
>
> Key: SPARK-29803
> URL: https://issues.apache.org/jira/browse/SPARK-29803
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
> Attachments: print_function_list.txt
>
>
> there are 135 python files in the spark repo that need to have `from 
> __future__ import print_function` removed (see attached file 
> 'print_function_list.txt').
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in the main and dev codes

2020-06-30 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32138:


 Summary: Drop Python 2, 3.4 and 3.5 in the main and dev codes
 Key: SPARK-32138
 URL: https://issues.apache.org/jira/browse/SPARK-32138
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29919) remove python2 test execution

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29919:
-
Parent: SPARK-29909
Issue Type: Sub-task  (was: Improvement)

> remove python2 test execution
> -
>
> Key: SPARK-29919
> URL: https://issues.apache.org/jira/browse/SPARK-29919
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> remove python2.7 (including pypy2) test executables from 'python/run-tests.py'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29919) remove python2 test execution

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29919:
-
Parent: (was: SPARK-29909)
Issue Type: Improvement  (was: Sub-task)

> remove python2 test execution
> -
>
> Key: SPARK-29919
> URL: https://issues.apache.org/jira/browse/SPARK-29919
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> remove python2.7 (including pypy2) test executables from 'python/run-tests.py'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29803) remove all instances of 'from __future__ import print_function'

2020-06-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148755#comment-17148755
 ] 

Hyukjin Kwon commented on SPARK-29803:
--

I will do it at SPARK-29909 

> remove all instances of 'from __future__ import print_function' 
> 
>
> Key: SPARK-29803
> URL: https://issues.apache.org/jira/browse/SPARK-29803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
> Attachments: print_function_list.txt
>
>
> there are 135 python files in the spark repo that need to have `from 
> __future__ import print_function` removed (see attached file 
> 'print_function_list.txt').
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29803) remove all instances of 'from __future__ import print_function'

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29803:
-
Parent: (was: SPARK-29909)
Issue Type: Improvement  (was: Sub-task)

> remove all instances of 'from __future__ import print_function' 
> 
>
> Key: SPARK-29803
> URL: https://issues.apache.org/jira/browse/SPARK-29803
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
> Attachments: print_function_list.txt
>
>
> there are 135 python files in the spark repo that need to have `from 
> __future__ import print_function` removed (see attached file 
> 'print_function_list.txt').
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29803) remove all instances of 'from __future__ import print_function'

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29803.
--
Resolution: Duplicate

> remove all instances of 'from __future__ import print_function' 
> 
>
> Key: SPARK-29803
> URL: https://issues.apache.org/jira/browse/SPARK-29803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
> Attachments: print_function_list.txt
>
>
> there are 135 python files in the spark repo that need to have `from 
> __future__ import print_function` removed (see attached file 
> 'print_function_list.txt').
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-30 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148746#comment-17148746
 ] 

Thomas Graves commented on SPARK-32119:
---

You can specify the jars in extraClassPath but it requires them to be on the 
node.

 

> ExecutorPlugin doesn't work with Standalone Cluster
> ---
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32135) Show Spark Driver name on Spark history web page

2020-06-30 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148742#comment-17148742
 ] 

Thomas Graves commented on SPARK-32135:
---

[~gaurangi]can you please clarify what you mean by "driver host" and where you 
want to see it.    The history server is an independent process and could be on 
a separate host from everything else.

If you are talking within an application you can see it by going to the 
executors page.

> Show Spark Driver name on Spark history web page
> 
>
> Key: SPARK-32135
> URL: https://issues.apache.org/jira/browse/SPARK-32135
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
>
> We would like to see spark driver host on the history server web page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31816) Create high level description about JDBC connection providers for users/developers

2020-06-30 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148712#comment-17148712
 ] 

Gabor Somogyi commented on SPARK-31816:
---

Originally I've thought it's enough to add developer description but since I've 
received several questions from users what's supported then I've reconsidered 
my original idea. I'm planning to add user and developer description as well.

> Create high level description about JDBC connection providers for 
> users/developers
> --
>
> Key: SPARK-31816
> URL: https://issues.apache.org/jira/browse/SPARK-31816
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31816) Create high level description about JDBC connection providers for users/developers

2020-06-30 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-31816:
--
Summary: Create high level description about JDBC connection providers for 
users/developers  (was: Create high level description about JDBC connection 
providers for developers)

> Create high level description about JDBC connection providers for 
> users/developers
> --
>
> Key: SPARK-31816
> URL: https://issues.apache.org/jira/browse/SPARK-31816
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32068) Spark 3 UI task launch time show in error time zone

2020-06-30 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-32068:
-

Assignee: JinxinTang

> Spark 3 UI task launch time show in error time zone
> ---
>
> Key: SPARK-32068
> URL: https://issues.apache.org/jira/browse/SPARK-32068
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Smith Cruise
>Assignee: JinxinTang
>Priority: Major
>  Labels: easyfix
> Fix For: 3.0.1, 3.1.0
>
> Attachments: correct.png, incorrect.png
>
>
> For example,
> In this link: history/app-20200623133209-0015/stages/ , stage submit time is 
> correct (CST)
>  
> But in this link: 
> history/app-20200623133209-0015/stages/stage/?id=0=0 , task launch 
> time is incorrect(UTC)
>  
> The same problem exists in port 4040 Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32068) Spark 3 UI task launch time show in error time zone

2020-06-30 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-32068.
---
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

> Spark 3 UI task launch time show in error time zone
> ---
>
> Key: SPARK-32068
> URL: https://issues.apache.org/jira/browse/SPARK-32068
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Smith Cruise
>Priority: Major
>  Labels: easyfix
> Fix For: 3.0.1, 3.1.0
>
> Attachments: correct.png, incorrect.png
>
>
> For example,
> In this link: history/app-20200623133209-0015/stages/ , stage submit time is 
> correct (CST)
>  
> But in this link: 
> history/app-20200623133209-0015/stages/stage/?id=0=0 , task launch 
> time is incorrect(UTC)
>  
> The same problem exists in port 4040 Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31797) Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148694#comment-17148694
 ] 

Apache Spark commented on SPARK-31797:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28956

> Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
> ---
>
> Key: SPARK-31797
> URL: https://issues.apache.org/jira/browse/SPARK-31797
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: [PR link|https://github.com/apache/spark/pull/28534]
>Reporter: JinxinTang
>Assignee: JinxinTang
>Priority: Major
> Fix For: 3.1.0
>
>
> 1.Add and register three new functions: {{TIMESTAMP_SECONDS}}, 
> {{TIMESTAMP_MILLIS}} and {{TIMESTAMP_MICROS}}
> Reference: 
> [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds]
> 2.People will have convenient way to get timestamps from seconds,milliseconds 
> and microseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31797) Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148696#comment-17148696
 ] 

Apache Spark commented on SPARK-31797:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28956

> Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
> ---
>
> Key: SPARK-31797
> URL: https://issues.apache.org/jira/browse/SPARK-31797
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: [PR link|https://github.com/apache/spark/pull/28534]
>Reporter: JinxinTang
>Assignee: JinxinTang
>Priority: Major
> Fix For: 3.1.0
>
>
> 1.Add and register three new functions: {{TIMESTAMP_SECONDS}}, 
> {{TIMESTAMP_MILLIS}} and {{TIMESTAMP_MICROS}}
> Reference: 
> [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds]
> 2.People will have convenient way to get timestamps from seconds,milliseconds 
> and microseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Sanjeev Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148663#comment-17148663
 ] 

Sanjeev Mishra edited comment on SPARK-32130 at 6/30/20, 1:35 PM:
--

I tried to load entire dataset using above suggestions and time did come close 
but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but 
question is why?). I would definitely agree above comment that if the default 
behavior is changed the community MUST be notified otherwise it is a huge time 
waste for all. Here are the various results

*Spark 2.4* 

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"){color}.json("/data/20200528/").count)
 Time taken: {color:#ff}29706 ms{color}
 res0: Long = 2605349

 

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count)
 Time taken: {color:#de350b}31431 ms{color}
 res0: Long = 2605349

 

*Spark 3.0*

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"{color}).json("/data/20200528/").count)
 Time taken: {color:#de350b}32826 ms{color}
 res0: Long = 2605349

 

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count)
 Time taken: {color:#de350b}34011 ms{color}
 res0: Long = 2605349

 

{color:#de350b}*Note:*{color}
 # Make sure {color:#de350b}you never turn on prefersDecimal to true{color} 
even when inferTimestamp is false, it again takes huge amount of time.
 # Spark 3.0 + JDK 11 is slower than Spark 3.0 + JDK 8 by almost 6 sec.

 

 


was (Author: lotus2you):
I tried to load entire dataset using above suggestions and time did come close 
but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but 
question is why?). I would definitely agree above comment that if the default 
behavior is changed the community MUST be notified otherwise it is a huge time 
waste for all. Here are the various results

*Spark 2.4* 

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"){color}.json("/data/20200528/").count)
Time taken: {color:#FF}29706 ms{color}
res0: Long = 2605349

 

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count)
Time taken: {color:#de350b}31431 ms{color}
res0: Long = 2605349

 

*Spark 3.0*

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"{color}).json("/data/20200528/").count)
Time taken: {color:#de350b}32826 ms{color}
res0: Long = 2605349

 

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count)
Time taken: {color:#de350b}34011 ms{color}
res0: Long = 2605349

 

{color:#de350b}*Note:*{color}

Make sure {color:#de350b}you never turn on prefersDecimal to true{color} even 
when inferTimestamp is false, it again takes huge amount of time.

 

 

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers 

[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Sanjeev Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148663#comment-17148663
 ] 

Sanjeev Mishra commented on SPARK-32130:


I tried to load entire dataset using above suggestions and time did come close 
but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but 
question is why?). I would definitely agree above comment that if the default 
behavior is changed the community MUST be notified otherwise it is a huge time 
waste for all. Here are the various results

*Spark 2.4* 

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"){color}.json("/data/20200528/").count)
Time taken: {color:#FF}29706 ms{color}
res0: Long = 2605349

 

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count)
Time taken: {color:#de350b}31431 ms{color}
res0: Long = 2605349

 

*Spark 3.0*

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false"{color}).json("/data/20200528/").count)
Time taken: {color:#de350b}32826 ms{color}
res0: Long = 2605349

 

spark.time(spark.read.{color:#0747a6}option("inferTimestamp","false").option("prefersDecimal","false"{color}).json("/data/20200528/").count)
Time taken: {color:#de350b}34011 ms{color}
res0: Long = 2605349

 

{color:#de350b}*Note:*{color}

Make sure {color:#de350b}you never turn on prefersDecimal to true{color} even 
when inferTimestamp is false, it again takes huge amount of time.

 

 

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
> To reproduce the issue please untar the attachment - it will have multiple 
> .json.gz files whose contents will look similar to following
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> 

[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Sanjeev Mishra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanjeev Mishra updated SPARK-32130:
---
Description: 
We are planning to move to Spark 3 but the read performance of our json files 
is unacceptable. Following is the performance numbers when compared to Spark 2.4

 
 Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
 Time taken: {color:#ff}19691 ms{color}
 res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
more fields]
 scala> spark.time(res61.count())
 Time taken: {color:#ff}7113 ms{color}
 res64: Long = 2605349
 Spark 3.0
 scala> spark.time(spark.read.json("/data/20200528"))
 20/06/29 08:06:53 WARN package: Truncated the string representation of a plan 
since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
 Time taken: {color:#ff}849652 ms{color}
 res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more 
fields]
 scala> spark.time(res0.count())
 Time taken: {color:#ff}8201 ms{color}
 res2: Long = 2605349
  
  
 I am attaching a sample data (please delete is once you are able to reproduce 
the issue) that is much smaller than the actual size but the performance 
comparison can still be verified.

The sample tar contains bunch of json.gz files, each line of the file is self 
contained json doc as shown below

To reproduce the issue please untar the attachment - it will have multiple 
.json.gz files whose contents will look similar to following

 
{quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS
 HNC IGD","Annex F 
Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
 Arris FastPath Speed 
Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
 HNC IGD 
EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
 Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\\{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}{color}
{quote}
{quote}{color:#741b47}{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":\{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
 HNC IGD","Annex F 
Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports
 Arris FastPath Speed 
Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
 HNC IGD 

[jira] [Updated] (SPARK-29919) remove python2 test execution

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29919:
-
Parent: (was: SPARK-27884)
Issue Type: Bug  (was: Sub-task)

> remove python2 test execution
> -
>
> Key: SPARK-29919
> URL: https://issues.apache.org/jira/browse/SPARK-29919
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> remove python2.7 (including pypy2) test executables from 'python/run-tests.py'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29919) remove python2 test execution

2020-06-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29919:
-
Parent: SPARK-29909
Issue Type: Sub-task  (was: Bug)

> remove python2 test execution
> -
>
> Key: SPARK-29919
> URL: https://issues.apache.org/jira/browse/SPARK-29919
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> remove python2.7 (including pypy2) test executables from 'python/run-tests.py'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31060) Handle column names containing `dots` in data source `Filter`

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148594#comment-17148594
 ] 

Apache Spark commented on SPARK-31060:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28955

> Handle column names containing `dots` in data source `Filter`
> -
>
> Key: SPARK-31060
> URL: https://issues.apache.org/jira/browse/SPARK-31060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31026) Parquet predicate pushdown on columns with dots

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148593#comment-17148593
 ] 

Apache Spark commented on SPARK-31026:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28955

> Parquet predicate pushdown on columns with dots
> ---
>
> Key: SPARK-31026
> URL: https://issues.apache.org/jira/browse/SPARK-31026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> Parquet predicate pushdown on columns with dots is disabled in -SPARK-20364- 
> due to Parquet's APIs don't support it. A new set of APIs is purposed in 
> PARQUET-1809 to generalize the support of nested cols which can address this 
> issue. This implementation will be merged into Spark repo first until we get 
> a new release from Parquet community.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31060) Handle column names containing `dots` in data source `Filter`

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148592#comment-17148592
 ] 

Apache Spark commented on SPARK-31060:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28955

> Handle column names containing `dots` in data source `Filter`
> -
>
> Key: SPARK-31060
> URL: https://issues.apache.org/jira/browse/SPARK-31060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31026) Parquet predicate pushdown on columns with dots

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148590#comment-17148590
 ] 

Apache Spark commented on SPARK-31026:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28955

> Parquet predicate pushdown on columns with dots
> ---
>
> Key: SPARK-31026
> URL: https://issues.apache.org/jira/browse/SPARK-31026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> Parquet predicate pushdown on columns with dots is disabled in -SPARK-20364- 
> due to Parquet's APIs don't support it. A new set of APIs is purposed in 
> PARQUET-1809 to generalize the support of nested cols which can address this 
> issue. This implementation will be merged into Spark repo first until we get 
> a new release from Parquet community.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17636) Parquet predicate pushdown for nested fields

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148589#comment-17148589
 ] 

Apache Spark commented on SPARK-17636:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28955

> Parquet predicate pushdown for nested fields
> 
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3, 2.0.2
>Reporter: Mitesh
>Assignee: DB Tsai
>Priority: Critical
> Fix For: 3.0.0
>
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17636) Parquet predicate pushdown for nested fields

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148588#comment-17148588
 ] 

Apache Spark commented on SPARK-17636:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28955

> Parquet predicate pushdown for nested fields
> 
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3, 2.0.2
>Reporter: Mitesh
>Assignee: DB Tsai
>Priority: Critical
> Fix For: 3.0.0
>
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25556) Predicate Pushdown for Nested fields

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148587#comment-17148587
 ] 

Apache Spark commented on SPARK-25556:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28955

> Predicate Pushdown for Nested fields
> 
>
> Key: SPARK-25556
> URL: https://issues.apache.org/jira/browse/SPARK-25556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> This is an umbrella JIRA to support predicate pushdown for nested fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148585#comment-17148585
 ] 

Apache Spark commented on SPARK-32083:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28954

> Unnecessary tasks are launched when input is empty with AQE
> ---
>
> Key: SPARK-32083
> URL: https://issues.apache.org/jira/browse/SPARK-32083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> [https://github.com/apache/spark/pull/28226] meant to avoid launching 
> unnecessary tasks for 0-size partitions when AQE is enabled. However, when 
> all partitions are empty, the number of partitions will be 
> `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) 
> unnecessary tasks are launched in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE

2020-06-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148584#comment-17148584
 ] 

Apache Spark commented on SPARK-32083:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28954

> Unnecessary tasks are launched when input is empty with AQE
> ---
>
> Key: SPARK-32083
> URL: https://issues.apache.org/jira/browse/SPARK-32083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> [https://github.com/apache/spark/pull/28226] meant to avoid launching 
> unnecessary tasks for 0-size partitions when AQE is enabled. However, when 
> all partitions are empty, the number of partitions will be 
> `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) 
> unnecessary tasks are launched in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32137) AttributeError: Can only use .dt accessor with datetimelike values

2020-06-30 Thread David Lacalle Castillo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lacalle Castillo updated SPARK-32137:
---
Priority: Critical  (was: Major)

> AttributeError: Can only use .dt accessor with datetimelike values
> --
>
> Key: SPARK-32137
> URL: https://issues.apache.org/jira/browse/SPARK-32137
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.5
>Reporter: David Lacalle Castillo
>Priority: Critical
>
> I was using a pandas udf with a dataframe containing a date object. I was 
> using the lastversion of pyarrow, 0.17.0.
> I setup this variable on zeppelin spark interpreter:
> ARROW_PRE_0_15_IPC_FORMAT=1
>  
> However, I was getting the following error:
> Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 19.0 (TID 1619, 10.20.0.5, executor 
> 1): org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
>  process()
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in 
> process
>  serializer.dump_stream(func(split_index, iterator), outfile)
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 290, 
> in dump_stream
>  for series in iterator:
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, 
> in load_stream
>  yield [self.arrow_to_pandas(c) for c in 
> pa.Table.from_batches([batch]).itercolumns()]
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, 
> in 
>  yield [self.arrow_to_pandas(c) for c in 
> pa.Table.from_batches([batch]).itercolumns()]
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 278, 
> in arrow_to_pandas
>  s = _check_series_convert_date(s, from_arrow_type(arrow_column.type))
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1692, in 
> _check_series_convert_date
>  return series.dt.date
>  File "/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py", line 
> 5270, in getattr
>  return object.getattribute(self, name)
>  File "/usr/local/lib/python3.7/dist-packages/pandas/core/accessor.py", line 
> 187, in get
>  accessor_obj = self._accessor(obj)
>  File 
> "/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/accessors.py", 
> line 338, in new
>  raise AttributeError("Can only use .dt accessor with datetimelike values")
> AttributeError: Can only use .dt accessor with datetimelike values
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>  at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
>  at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:123)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>  at 
> 

[jira] [Created] (SPARK-32137) AttributeError: Can only use .dt accessor with datetimelike values

2020-06-30 Thread David Lacalle Castillo (Jira)
David Lacalle Castillo created SPARK-32137:
--

 Summary: AttributeError: Can only use .dt accessor with 
datetimelike values
 Key: SPARK-32137
 URL: https://issues.apache.org/jira/browse/SPARK-32137
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.4.5
Reporter: David Lacalle Castillo


I was using a pandas udf with a dataframe containing a date object. I was using 
the lastversion of pyarrow, 0.17.0.

I setup this variable on zeppelin spark interpreter:

ARROW_PRE_0_15_IPC_FORMAT=1

 

However, I was getting the following error:

Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 19.0 (TID 1619, 10.20.0.5, executor 1): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
 File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
 process()
 File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in 
process
 serializer.dump_stream(func(split_index, iterator), outfile)
 File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 290, in 
dump_stream
 for series in iterator:
 File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, in 
load_stream
 yield [self.arrow_to_pandas(c) for c in 
pa.Table.from_batches([batch]).itercolumns()]
 File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 311, in 

 yield [self.arrow_to_pandas(c) for c in 
pa.Table.from_batches([batch]).itercolumns()]
 File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 278, in 
arrow_to_pandas
 s = _check_series_convert_date(s, from_arrow_type(arrow_column.type))
 File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1692, in 
_check_series_convert_date
 return series.dt.date
 File "/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py", line 
5270, in getattr
 return object.getattribute(self, name)
 File "/usr/local/lib/python3.7/dist-packages/pandas/core/accessor.py", line 
187, in get
 accessor_obj = self._accessor(obj)
 File 
"/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/accessors.py", line 
338, in new
 raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
 at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
 at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:123)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148570#comment-17148570
 ] 

Jungtaek Lim commented on SPARK-32130:
--

Looks like we already saw the difference but we missed to consider the 
performance regression. 

Please look at the table in PR description of 
https://github.com/apache/spark/pull/23653.

- inferTimestamp=default (true) & prefersDecimal=default (false) took 6.1 
minutes
- inferTimestamp=false & prefersDecimal=default (false) took 49 seconds

The difference simply came from inferTimestamp and it was noticeable.

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> 

[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148559#comment-17148559
 ] 

Jungtaek Lim commented on SPARK-32130:
--

So anyone can just reproduce via running spark-shell on Spark 3.0.0 and run the 
query, and shut down, and rerun spark-shell, and run another query.

{code}
spark.time(spark.read.option("inferTimestamp", "false").json("datadir").count())
{code}

{code}
spark.time(spark.read.option("inferTimestamp", "true").json("datadir").count())
{code}

{quote}
it might be of use to consult the entire community by writing to the user group 
if we are changing the default behaviour. Because it might affect a lot of 
production code.
{quote}

The default behavior was changed in Spark 3.0.0 - I agree it's bad user 
experience to make it back and forth, but if it contributes the performance 
silently, IMHO the option should be ideally turned off by default so that end 
users will turn on by their own intention.

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> 

[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148546#comment-17148546
 ] 

Jungtaek Lim commented on SPARK-32130:
--

For me it's reproduced consistently. Please make sure you run the query in 
spark-shell with default setting of Spark 2.4.5 vs Spark 3.0.0. To remove any 
variant, don't add "multiLine" option.

spark.time(spark.read.json("datadir").count())

Spark 2.4.5: 3215ms
Spark 3.0.0: 13317ms

adding `.option("inferTimestamp", "false")` before .json leads the elapsed time 
closer - 4716 ms - still have a difference.

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> 

[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread JinxinTang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148501#comment-17148501
 ] 

JinxinTang commented on SPARK-32130:


[~gourav.sengupta] Nice notebook, is seems the row count() is 33447 instead of 
11. it can be explained because 3s different is due to jvm warm up if record 
count is 11 instead of 33447.

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> 

[jira] [Comment Edited] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Gourav (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148459#comment-17148459
 ] 

Gourav edited comment on SPARK-32130 at 6/30/20, 9:02 AM:
--

[~JinxinTang] and [~lotus2you]

I think that it is only the first time that the time taken by SPARK 3.x is 
more, and in the subsequent times, its less, as the schema would have already 
been determined. Setting the inferTimestamp to be False does not improve the 
performance as well.[^SPARK 32130 - replication and findings.ipynb][^SPARK 
32130 - replication and findings.ipynb]

Note book with details and full reproducibility is attached. 

[~kabhwan] it might be of use to consult the entire community by writing to the 
user group if we are changing the default behaviour. Because it might affect a 
lot of production code.


was (Author: gourav.sengupta):
[~JinxinTang] and [~lotus2you]

I think that it is only the first time that the time taken by SPARK 3.x is 
more, and in the subsequent times, its less, as the schema would have already 
been determined. Setting the inferTimestamp to be False does not improve the 
performance as well.[^SPARK 32130 - replication and findings.ipynb][^SPARK 
32130 - replication and findings.ipynb]

Note book with details are attached. 

[~kabhwan] it might be of use to consult the entire community by writing to the 
user group if we are changing the default behaviour. Because it might affect a 
lot of production code.

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> 

[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Gourav (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148459#comment-17148459
 ] 

Gourav commented on SPARK-32130:


[~JinxinTang] and [~lotus2you]

I think that it is only the first time that the time taken by SPARK 3.x is 
more, and in the subsequent times, its less, as the schema would have already 
been determined. Setting the inferTimestamp to be False does not improve the 
performance as well.[^SPARK 32130 - replication and findings.ipynb][^SPARK 
32130 - replication and findings.ipynb]

Note book with details are attached. 

[~kabhwan] it might be of use to consult the entire community by writing to the 
user group if we are changing the default behaviour. Because it might affect a 
lot of production code.

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> 

[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Gourav (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gourav updated SPARK-32130:
---
Attachment: SPARK 32130 - replication and findings.ipynb

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> 

[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Gourav (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gourav updated SPARK-32130:
---
Attachment: (was: SPARK 32130 - replication and findings.ipynb)

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> 

[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Gourav (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gourav updated SPARK-32130:
---
Attachment: SPARK 32130 - replication and findings.ipynb

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> 

[jira] [Updated] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties

2020-06-30 Thread Jason Moore (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Moore updated SPARK-32136:

Description: 
I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
noticing a behaviour change in a particular aggregation we're doing, and I 
think I've tracked it down to how Spark is now treating nullable properties 
within the column being grouped by.
 
Here's a simple test I've been able to set up to repro it:
 

{code:scala}
case class B(c: Option[Double])
case class A(b: Option[B])
val df = Seq(
A(None),
A(Some(B(None))),
A(Some(B(Some(1.0
).toDF
val res = df.groupBy("b").agg(count("*"))
{code}

Spark 2.4.6 has the expected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   1|
| null|   1|
|[1.0]|   1|
+-++

> res.collect.foreach(println)
[[null],1]
[null,1]
[[1.0],1]
{noformat}

But Spark 3.0.0 has an unexpected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   2|
|[1.0]|   1|
+-++

> res.collect.foreach(println)
[[null],2]
[[1.0],1]
{noformat}

Notice how it has keyed one of the values in be as `[null]`; that is, an 
instance of B with a null value for the `c` property instead of a null for the 
overall value itself.

Is this an intended change?

  was:
I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
noticing a behaviour change in a particular aggregation we're doing, and I 
think I've tracked it down to how Spark is now treating nullable properties 
within the column being grouped by.
 
Here's a simple test I've been able to set up to repro it:
 

{code:scala}
case class B(c: Option[Double])
case class A(b: Option[B])
val df = Seq(
A(None),
A(Some(B(None))),
A(Some(B(Some(1.0
).toDF
val res = df.groupBy("b").agg(count("*"))
{code}

Spark 2.4.6 has the expected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   1|
| null|   1|
|[1.0]|   1|
+-++
> res.collect.foreach(println)
[[null],1]
[null,1]
[[1.0],1]
{noformat}

But Spark 3.0.0 has an unexpected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   2|
|[1.0]|   1|
+-++

> res.collect.foreach(println)
[[null],2]
[[1.0],1]
{noformat}

Notice how it has keyed one of the values in be as `[null]`; that is, an 
instance of B with a null value for the `c` property instead of a null for the 
overall value itself.

Is this an intended change?


> Spark producing incorrect groupBy results when key is a struct with nullable 
> properties
> ---
>
> Key: SPARK-32136
> URL: https://issues.apache.org/jira/browse/SPARK-32136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Moore
>Priority: Major
>
> I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
> noticing a behaviour change in a particular aggregation we're doing, and I 
> think I've tracked it down to how Spark is now treating nullable properties 
> within the column being grouped by.
>  
> Here's a simple test I've been able to set up to repro it:
>  
> {code:scala}
> case class B(c: Option[Double])
> case class A(b: Option[B])
> val df = Seq(
> A(None),
> A(Some(B(None))),
> A(Some(B(Some(1.0
> ).toDF
> val res = df.groupBy("b").agg(count("*"))
> {code}
> Spark 2.4.6 has the expected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   1|
> | null|   1|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],1]
> [null,1]
> [[1.0],1]
> {noformat}
> But Spark 3.0.0 has an unexpected result:
> {noformat}
> > res.show
> +-++
> |b|count(1)|
> +-++
> |   []|   2|
> |[1.0]|   1|
> +-++
> > res.collect.foreach(println)
> [[null],2]
> [[1.0],1]
> {noformat}
> Notice how it has keyed one of the values in be as `[null]`; that is, an 
> instance of B with a null value for the `c` property instead of a null for 
> the overall value itself.
> Is this an intended change?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32136) Spark producing incorrect groupBy results when key is a struct with nullable properties

2020-06-30 Thread Jason Moore (Jira)
Jason Moore created SPARK-32136:
---

 Summary: Spark producing incorrect groupBy results when key is a 
struct with nullable properties
 Key: SPARK-32136
 URL: https://issues.apache.org/jira/browse/SPARK-32136
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Jason Moore


I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm 
noticing a behaviour change in a particular aggregation we're doing, and I 
think I've tracked it down to how Spark is now treating nullable properties 
within the column being grouped by.
 
Here's a simple test I've been able to set up to repro it:
 

{code:scala}
case class B(c: Option[Double])
case class A(b: Option[B])
val df = Seq(
A(None),
A(Some(B(None))),
A(Some(B(Some(1.0
).toDF
val res = df.groupBy("b").agg(count("*"))
{code}

Spark 2.4.6 has the expected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   1|
| null|   1|
|[1.0]|   1|
+-++
> res.collect.foreach(println)
[[null],1]
[null,1]
[[1.0],1]
{noformat}

But Spark 3.0.0 has an unexpected result:

{noformat}
> res.show
+-++
|b|count(1)|
+-++
|   []|   2|
|[1.0]|   1|
+-++

> res.collect.foreach(println)
[[null],2]
[[1.0],1]
{noformat}

Notice how it has keyed one of the values in be as `[null]`; that is, an 
instance of B with a null value for the `c` property instead of a null for the 
overall value itself.

Is this an intended change?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >