date:20200428

[jira] [Commented] (SPARK-31480) Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-28 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095093#comment-17095093
 ] 

Xiao Li commented on SPARK-31480:
-

let me assign it to [~dkbiswal] 

> Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
> ---
>
> Key: SPARK-31480
> URL: https://issues.apache.org/jira/browse/SPARK-31480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Below is the EXPLAIN OUTPUT when using the *DSV2* 
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- BatchScan[col.dots#39L] JsonScan DataFilters: [isnotnull(col.dots#39L), 
> (col.dots#39L = 500)], Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-7dad6f63-dc...,
>  PartitionFilters: [], ReadSchema: struct
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) BatchScan
> Output [1]: [col.dots#39L]
> Arguments: [col.dots#39L], 
> JsonScan(org.apache.spark.sql.test.TestSparkSession@45eab322,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@72065f16,StructType(StructField(col.dots,LongType,true)),StructType(StructField(col.dots,LongType,true)),StructType(),org.apache.spark.sql.util.CaseInsensitiveStringMap@8822c5e0,Vector(),List(isnotnull(col.dots#39L),
>  (col.dots#39L = 500)))
> {code}
> When using *DSV1*, the output is much cleaner than the output of DSV2, 
> especially for EXPLAIN FORMATTED.
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- FileScan json [col.dots#37L] Batched: false, DataFilters: 
> [isnotnull(col.dots#37L), (col.dots#37L = 500)], Format: JSON, Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-59...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(`col.dots`), 
> EqualTo(`col.dots`,500)], ReadSchema: struct 
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) Scan json 
> Output [1]: [col.dots#37L]
> Batched: false
> Location: InMemoryFileIndex 
> [file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-5971-4a96-bf10-0730873f6ce0]
> PushedFilters: [IsNotNull(`col.dots`), EqualTo(`col.dots`,500)]
> ReadSchema: struct{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31567) Update AppVeyor Rtools to 4.0.0

2020-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31567:


Assignee: Dongjoon Hyun

> Update AppVeyor Rtools to 4.0.0
> ---
>
> Key: SPARK-31567
> URL: https://issues.apache.org/jira/browse/SPARK-31567
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This is a preparation for upgrade to R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31567) Update AppVeyor Rtools to 4.0.0

2020-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31567.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28358
[https://github.com/apache/spark/pull/28358]

> Update AppVeyor Rtools to 4.0.0
> ---
>
> Key: SPARK-31567
> URL: https://issues.apache.org/jira/browse/SPARK-31567
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> This is a preparation for upgrade to R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31567) Update AppVeyor Rtools to 4.0.0

2020-04-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31567:
--
Summary: Update AppVeyor Rtools to 4.0.0  (was: Update AppVeyor R version 
to 4.0.0)

> Update AppVeyor Rtools to 4.0.0
> ---
>
> Key: SPARK-31567
> URL: https://issues.apache.org/jira/browse/SPARK-31567
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31567) Update AppVeyor Rtools to 4.0.0

2020-04-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31567:
--
Description: This is a preparation for upgrade to R.

> Update AppVeyor Rtools to 4.0.0
> ---
>
> Key: SPARK-31567
> URL: https://issues.apache.org/jira/browse/SPARK-31567
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This is a preparation for upgrade to R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-31591) namePrefix could be null in Utils.createDirectory

2020-04-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-31591.
-

> namePrefix could be null in Utils.createDirectory
> -
>
> Key: SPARK-31591
> URL: https://issues.apache.org/jira/browse/SPARK-31591
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In our production, we find that many shuffle files could be located in
> /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a
> The Util.createDirectory() uses a default parameter "spark"
> {code}
>   def createDirectory(root: String, namePrefix: String = "spark"): File = {
> {code}
> But in some cases, the actual namePrefix is null. If the method is called 
> with null, then the default value would not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31182) PairRDD support aggregateByKeyWithinPartitions

2020-04-28 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31182.
--
Resolution: Not A Problem

> PairRDD support aggregateByKeyWithinPartitions
> --
>
> Key: SPARK-31182
> URL: https://issues.apache.org/jira/browse/SPARK-31182
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> When implementing `RobustScaler`, I was looking for a way to guarantee that 
> the {{QuantileSummaries in {{}}{{aggregateByKey}}{{ are compressed before 
> network communication.
> Then I only found a tricky method to work around (however not applied), and 
> there was no method for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31600) Error message from DataFrame creation is misleading.

2020-04-28 Thread Olexiy Oryeshko (Jira)

Olexiy Oryeshko created SPARK-31600:
---

 Summary: Error message from DataFrame creation is misleading.
 Key: SPARK-31600
 URL: https://issues.apache.org/jira/browse/SPARK-31600
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.5
 Environment: DataBricks 6.4, Spark 2.4.5, Scala 2.11
Reporter: Olexiy Oryeshko


*Description:*

DataFrame creation from pandas.DataFrame fails when one of the features 
contains only NaN values (which is ok).

However, error message mentions wrong feature as the culprit, which makes it 
hard to find the root cause.

*How to reproduce:*

 
{code:java}
import numpy as np
import pandas as pd
df2 = pd.DataFrame({'a': np.array([np.nan, np.nan], dtype=np.object_), 'b': 
[np.nan, 'aaa']})
display(spark.createDataFrame(df2[['b']]))   # Works fine
spark.createDataFrame(df2)            # Raises TypeError.
{code}
In the code above, column 'a' is bad. However, the `TypeError` raised in the 
last command mentions feature 'b' as the culprit:

TypeError: field b: Can not merge type  
and 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30261) Should not change owner of hive table for some commands like 'alter' operation

2020-04-28 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-30261.
--
Target Version/s: 2.4.3, 2.3.0  (was: 2.3.0, 2.4.3)
  Resolution: Duplicate

> Should not change owner of hive table  for  some commands like 'alter' 
> operation
> 
>
> Key: SPARK-30261
> URL: https://issues.apache.org/jira/browse/SPARK-30261
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.3
>Reporter: chenliang
>Priority: Critical
>
> For SparkSQL,When we do some alter operations on hive table, the owner of 
> hive table would be changed to someone who invoked the operation, it's 
> unresonable. And in fact, the owner should not changed for the real 
> prodcution environment, otherwise the  authority check is out of order.
> The problem can be reproduced as described in the below:
> 1.First I create a table with username='xie' and then \{{desc formatted table 
> }},the owner is 'xiepengjie'
> {code:java}
> spark-sql> desc formatted bigdata_test.tt1; 
> col_name data_type comment c int NULL 
> # Detailed Table Information 
> Database bigdata_test Table tt1 
> Owner xie 
> Created Time Wed Sep 11 11:30:49 CST 2019 
> Last Access Thu Jan 01 08:00:00 CST 1970 
> Created By Spark 2.2 or prior 
> Type MANAGED 
> Provider hive 
> Table Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, 
> LEVEL=1, TTL=60] 
> Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 
> Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde 
> InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
> OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
> Storage Properties [serialization.format=1] 
> Partition Provider Catalog Time taken: 0.371 seconds, Fetched 18 row(s)
> {code}
>  2.Then I use another username='johnchen' and execute {{alter table 
> bigdata_test.tt1 set location 
> 'hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1'}}, check the owner 
> of hive table is 'johnchen', it's unresonable
> {code:java}
> spark-sql> desc formatted bigdata_test.tt1; 
> col_name data_type comment c int NULL 
> # Detailed Table Information 
> Database bigdata_test 
> Table tt1 
> Owner johnchen 
> Created Time Wed Sep 11 11:30:49 CST 2019 
> Last Access Thu Jan 01 08:00:00 CST 1970 
> Created By Spark 2.2 or prior 
> Type MANAGED 
> Provider hive 
> Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, 
> LEVEL=1, TTL=60] 
> Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 
> Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde 
> InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
> OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
> Storage Properties [serialization.format=1] 
> Partition Provider Catalog 
> Time taken: 0.041 seconds, Fetched 18 row(s){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31591) namePrefix could be null in Utils.createDirectory

2020-04-28 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31591.
--
Resolution: Not A Problem

See: https://github.com/apache/spark/pull/28385#issuecomment-620941771

> namePrefix could be null in Utils.createDirectory
> -
>
> Key: SPARK-31591
> URL: https://issues.apache.org/jira/browse/SPARK-31591
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In our production, we find that many shuffle files could be located in
> /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a
> The Util.createDirectory() uses a default parameter "spark"
> {code}
>   def createDirectory(root: String, namePrefix: String = "spark"): File = {
> {code}
> But in some cases, the actual namePrefix is null. If the method is called 
> with null, then the default value would not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string

2020-04-28 Thread Adrian Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094980#comment-17094980
 ] 

Adrian Wang commented on SPARK-31595:
-

[~Ankitraj] Thanks, I have already created a pull request on this.

> Spark sql cli should allow unescaped quote mark in quoted string
> 
>
> Key: SPARK-31595
> URL: https://issues.apache.org/jira/browse/SPARK-31595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Adrian Wang
>Priority: Major
>
> spark-sql> select "'";
> spark-sql> select '"';
> In Spark parser if we pass a text of `select "'";`, there will be 
> ParserCancellationException, which will be handled by PredictionMode.LL. By 
> dropping `;` correctly we can avoid that retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore

2020-04-28 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-31584.

Target Version/s: 3.0.0, 3.0.1, 3.1.0  (was: 3.0.1)
Assignee: Baohe Zhang
  Resolution: Fixed

The issue is resolved in https://github.com/apache/spark/pull/28378

> NullPointerException when parsing event log with InMemoryStore
> --
>
> Key: SPARK-31584
> URL: https://issues.apache.org/jira/browse/SPARK-31584
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Minor
> Fix For: 3.0.1
>
> Attachments: errorstack.txt
>
>
> I compiled with the current branch-3.0 source and tested it in mac os. A 
> java.lang.NullPointerException will be thrown when below conditions are met: 
>  # Using InMemoryStore as kvstore when parsing the event log file (e.g., when 
> spark.history.store.path is unset). 
>  # At least one stage in this event log has task number greater than 
> spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to 
> delete extra task records.
>  # The job has more than one stage, so parentToChildrenMap in 
> InMemoryStore.java will have more than one key.
> The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In 
> the method deleteParentIndex().
> {code:java}
> private void deleteParentIndex(Object key) {
>   if (hasNaturalParentIndex) {
> for (NaturalKeys v : parentToChildrenMap.values()) {
>   if (v.remove(asKey(key))) {
> // `v` can be empty after removing the natural key and we can 
> remove it from
> // `parentToChildrenMap`. However, `parentToChildrenMap` is a 
> ConcurrentMap and such
> // checking and deleting can be slow.
> // This method is to delete one object with certain key, let's 
> make it simple here.
> break;
>   }
> }
>   }
> }{code}
> In “if (v.remove(asKey(key)))”, if the key is not contained in v,  
> "v.remove(asKey(key))" will return null, and java will throw a 
> NullPointerException when executing "if (null)".
> An exception stack trace is attached.
> This issue can be fixed by updating if statement to
> {code:java}
> if (v.remove(asKey(key)) != null){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31556) Document LIKE clause in SQL Reference

2020-04-28 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31556.
--
Fix Version/s: 3.0.0
 Assignee: Huaxin Gao
   Resolution: Fixed

Resolved by https://issues.apache.org/jira/browse/SPARK-31556

> Document LIKE clause in SQL Reference
> -
>
> Key: SPARK-31556
> URL: https://issues.apache.org/jira/browse/SPARK-31556
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Document LIKE clause in SQL Reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2020-04-28 Thread Lorenzo Pisani (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094932#comment-17094932
 ] 

Lorenzo Pisani commented on SPARK-26365:


I'm also seeing this behavior specifically with a "cluster" deploy mode. The 
driver pod is failing properly but the pod that executed spark-submit is 
exiting with a status code of 0. This makes it very difficult to monitor the 
job and detect failures.

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Oscar Bonilla
>Priority: Minor
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction

2020-04-28 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094925#comment-17094925
 ] 

Jungtaek Lim commented on SPARK-31599:
--

Oh sorry I should guide to user@ mailing list, my bad. Please have your time to 
go through the page http://spark.apache.org/community.html 

> Reading from S3 (Structured Streaming Bucket) Fails after Compaction
> 
>
> Key: SPARK-31599
> URL: https://issues.apache.org/jira/browse/SPARK-31599
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
> Structured Streaming Framework from Kafka. Periodically I try to run 
> compaction on this bucket (a separate Spark Job), and on successful 
> compaction delete the non compacted (parquet) files. After which I am getting 
> following error on Spark jobs which read from that bucket:
>  *Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*
> How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need 
> to delete the un-compacted files after successful compaction to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction

2020-04-28 Thread Felix Kizhakkel Jose (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094917#comment-17094917
 ] 

Felix Kizhakkel Jose commented on SPARK-31599:
--

How do I do that? 

> Reading from S3 (Structured Streaming Bucket) Fails after Compaction
> 
>
> Key: SPARK-31599
> URL: https://issues.apache.org/jira/browse/SPARK-31599
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
> Structured Streaming Framework from Kafka. Periodically I try to run 
> compaction on this bucket (a separate Spark Job), and on successful 
> compaction delete the non compacted (parquet) files. After which I am getting 
> following error on Spark jobs which read from that bucket:
>  *Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*
> How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need 
> to delete the un-compacted files after successful compaction to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction

2020-04-28 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094915#comment-17094915
 ] 

Jungtaek Lim commented on SPARK-31599:
--

Please post a mail thread on dev@ mailing list. This looks to be a question 
instead of actual bug report.

> Reading from S3 (Structured Streaming Bucket) Fails after Compaction
> 
>
> Key: SPARK-31599
> URL: https://issues.apache.org/jira/browse/SPARK-31599
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
> Structured Streaming Framework from Kafka. Periodically I try to run 
> compaction on this bucket (a separate Spark Job), and on successful 
> compaction delete the non compacted (parquet) files. After which I am getting 
> following error on Spark jobs which read from that bucket:
>  *Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*
> How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need 
> to delete the un-compacted files after successful compaction to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8333) Spark failed to delete temp directory created by HiveContext

2020-04-28 Thread Sunil Kumar Chakrapani (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-8333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094893#comment-17094893
 ] 

Sunil Kumar Chakrapani commented on SPARK-8333:
---

Any plans to fix this issue for Spark 2.4.5, issue still exists on Windows 10 

 

20/04/26 12:39:12 ERROR ShutdownHookManager: Exception while deleting Spark 
temp dir: 
C:\Users\\AppData\Local\Temp\2\spark-1583d46e-c31f-444a-91f1-572c0726b6b1
java.io.IOException: Failed to delete: 
C:\Users\\AppData\Local\Temp\2\spark-1583d46e-c31f-444a-91f1-572c0726b6b1\userFiles-b001454b-80e1-4414-896b-6aee986174e5\test_jar_2.11-0.1.jar
 at 
org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:144)
 at 
org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:118)
 at 
org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:128)
 at 
org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:118)
 at 
org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:128)
 at 
org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:118)
 at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:91)
 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1062)
 at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65)
 at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62)
 at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
 at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
 at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
 at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
 at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
 at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
 at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
 at scala.util.Try$.apply(Try.scala:192)
 at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
 at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
 at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

> Spark failed to delete temp directory created by HiveContext
> 
>
> Key: SPARK-8333
> URL: https://issues.apache.org/jira/browse/SPARK-8333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Windows7 64bit
>Reporter: sheng
>Priority: Minor
>  Labels: Hive, bulk-closed, metastore, sparksql
> Attachments: test.tar
>
>
> Spark 1.4.0 failed to stop SparkContext.
> {code:title=LocalHiveTest.scala|borderStyle=solid}
>  val sc = new SparkContext("local", "local-hive-test", new SparkConf())
>  val hc = Utils.createHiveContext(sc)
>  ... // execute some HiveQL statements
>  sc.stop()
> {code}
> sc.stop() failed to execute, it threw the following exception:
> {quote}
> 15/06/13 03:19:06 INFO Utils: Shutdown hook called
> 15/06/13 03:19:06 INFO Utils: Deleting directory 
> C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
> 15/06/13 03:19:06 ERROR Utils: Exception while deleting Spark temp dir: 
> C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
> java.io.IOException: Failed to delete: 
> C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963)
>   at 
> org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204)
>   at 
> org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201)
>   at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2292)
>   at 
>

[jira] [Updated] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly

2020-04-28 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31549:
--
Target Version/s: 3.0.0

> Pyspark SparkContext.cancelJobGroup do not work correctly
> -
>
> Key: SPARK-31549
> URL: https://issues.apache.org/jira/browse/SPARK-31549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue 
> existing for a long time. This is because of pyspark thread didn't pinned to 
> jvm thread when invoking java side methods, which leads to all pyspark API 
> which related to java local thread variables do not work correctly. 
> (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` 
> and so on.)
> This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode 
> added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two 
> issue:
> * It is disabled by default. We need to set additional environment variable 
> to enable it.
> * There's memory leak issue which haven't been addressed.
> Now there's a series of project like hyperopt-spark, spark-joblib which rely 
> on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it 
> is critical to address this issue and we hope it work under default pyspark 
> mode. An optional approach is implementing methods like 
> `rdd.setGroupAndCollect`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string

2020-04-28 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094830#comment-17094830
 ] 

Ankit Raj Boudh commented on SPARK-31595:
-

[~adrian-wang], can i start working on this issue ?

> Spark sql cli should allow unescaped quote mark in quoted string
> 
>
> Key: SPARK-31595
> URL: https://issues.apache.org/jira/browse/SPARK-31595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Adrian Wang
>Priority: Major
>
> spark-sql> select "'";
> spark-sql> select '"';
> In Spark parser if we pass a text of `select "'";`, there will be 
> ParserCancellationException, which will be handled by PredictionMode.LL. By 
> dropping `;` correctly we can avoid that retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31591) namePrefix could be null in Utils.createDirectory

2020-04-28 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094829#comment-17094829
 ] 

Ankit Raj Boudh commented on SPARK-31591:
-

[~cltlfcjin], It's ok Thank you for raising PR :)

> namePrefix could be null in Utils.createDirectory
> -
>
> Key: SPARK-31591
> URL: https://issues.apache.org/jira/browse/SPARK-31591
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In our production, we find that many shuffle files could be located in
> /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a
> The Util.createDirectory() uses a default parameter "spark"
> {code}
>   def createDirectory(root: String, namePrefix: String = "spark"): File = {
> {code}
> But in some cases, the actual namePrefix is null. If the method is called 
> with null, then the default value would not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction

2020-04-28 Thread Felix Kizhakkel Jose (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-31599:
-
Description: 
I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
Structured Streaming Framework from Kafka. Periodically I try to run compaction 
on this bucket (a separate Spark Job), and on successful compaction delete the 
non compacted (parquet) files. After which I am getting following error on 
Spark jobs which read from that bucket:
 *Caused by: java.io.FileNotFoundException: No such file or directory: 
s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*

How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need 
to delete the un-compacted files after successful compaction to save space.

  was:
I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
Structured Streaming Framework from Kafka. Periodically I try to run compaction 
on this bucket (a separate Spark Job), and on successful compaction delete the 
non compacted (parquet) files. After which I am getting following error on 
Spark jobs which read from that bucket:
*Caused by: java.io.FileNotFoundException: No such file or directory: 
s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*

How do we run *_c__ompaction on Structured Streaming S3 bucket_s*. Also I need 
to delete the un-compacted files after successful compaction to save space.


> Reading from S3 (Structured Streaming Bucket) Fails after Compaction
> 
>
> Key: SPARK-31599
> URL: https://issues.apache.org/jira/browse/SPARK-31599
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
> Structured Streaming Framework from Kafka. Periodically I try to run 
> compaction on this bucket (a separate Spark Job), and on successful 
> compaction delete the non compacted (parquet) files. After which I am getting 
> following error on Spark jobs which read from that bucket:
>  *Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*
> How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need 
> to delete the un-compacted files after successful compaction to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction

2020-04-28 Thread Felix Kizhakkel Jose (Jira)

Felix Kizhakkel Jose created SPARK-31599:


 Summary: Reading from S3 (Structured Streaming Bucket) Fails after 
Compaction
 Key: SPARK-31599
 URL: https://issues.apache.org/jira/browse/SPARK-31599
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Structured Streaming
Affects Versions: 2.4.5
Reporter: Felix Kizhakkel Jose


I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
Structured Streaming Framework from Kafka. Periodically I try to run compaction 
on this bucket (a separate Spark Job), and on successful compaction delete the 
non compacted (parquet) files. After which I am getting following error on 
Spark jobs which read from that bucket:
*Caused by: java.io.FileNotFoundException: No such file or directory: 
s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*

How do we run *_c__ompaction on Structured Streaming S3 bucket_s*. Also I need 
to delete the un-compacted files after successful compaction to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30741) The data returned from SAS using JDBC reader contains column label

2020-04-28 Thread Gary Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Liu updated SPARK-30741:
-
Attachment: ExamplesFromSASSupport.png

> The data returned from SAS using JDBC reader contains column label
> --
>
> Key: SPARK-30741
> URL: https://issues.apache.org/jira/browse/SPARK-30741
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.1.1, 2.3.4, 2.4.5
>Reporter: Gary Liu
>Priority: Major
> Attachments: ExamplesFromSASSupport.png, ReplyFromSASSupport.png, 
> SparkBug.png
>
>
> When read SAS data using JDBC with SAS SHARE driver, the returned data 
> contains column labels, rather data. 
> According to testing result from SAS Support, the results are correct using 
> Java. So they believe it is due to spark reading. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30741) The data returned from SAS using JDBC reader contains column label

2020-04-28 Thread Gary Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Liu updated SPARK-30741:
-
Attachment: ReplyFromSASSupport.png

> The data returned from SAS using JDBC reader contains column label
> --
>
> Key: SPARK-30741
> URL: https://issues.apache.org/jira/browse/SPARK-30741
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.1.1, 2.3.4, 2.4.5
>Reporter: Gary Liu
>Priority: Major
> Attachments: ReplyFromSASSupport.png, SparkBug.png
>
>
> When read SAS data using JDBC with SAS SHARE driver, the returned data 
> contains column labels, rather data. 
> According to testing result from SAS Support, the results are correct using 
> Java. So they believe it is due to spark reading. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30741) The data returned from SAS using JDBC reader contains column label

2020-04-28 Thread Gary Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Liu reopened SPARK-30741:
--

*Problem:* The spark JDBC reader read SAS data incorrectly, and returned the 
data with the column names as data values.

 

*Possible Reason:* After discussed with SAS Support team, they think spark JDBC 
reader does not compliant with [JDBC 
specs|[https://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html#getIdentifierQuoteString()]],
 where getIdentifierQuoteString() should be called to get the quoted SQL 
identifiers used by the source system. This function in SAS JDBC driver 
returned a blank string. SAS Support team think spark does not call this 
function, but uses default double quote '"' to generate the query, so the query 
' select var_a from table_a' is passed as 'select "var_a" from table_a', and 
"table_a" string is populated as data values. 

> The data returned from SAS using JDBC reader contains column label
> --
>
> Key: SPARK-30741
> URL: https://issues.apache.org/jira/browse/SPARK-30741
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.1.1, 2.3.4, 2.4.5
>Reporter: Gary Liu
>Priority: Major
> Attachments: SparkBug.png
>
>
> When read SAS data using JDBC with SAS SHARE driver, the returned data 
> contains column labels, rather data. 
> According to testing result from SAS Support, the results are correct using 
> Java. So they believe it is due to spark reading. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31339) Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load()

2020-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31339.
--
Resolution: Not A Problem

> Changed PipelineModel(...) to self.cls(...) in 
> pyspark.ml.pipeline.PipelineModelReader.load()
> -
>
> Key: SPARK-31339
> URL: https://issues.apache.org/jira/browse/SPARK-31339
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Suraj
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> PR: [https://github.com/apache/spark/pull/28110]
>  * What changes were proposed in this pull request?
>  pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)
>  * Why are the changes needed?
>  This change fixes the loading of class (which inherits from PipelineModel 
> class) from file.
>  E.g. Current issue:
> {code:java}
> CustomPipelineModel(PipelineModel):
>     def _transform(self, df):
>         ...
>  CustomPipelineModel.save('path/to/file') # works
>  CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
> instead of CustomPipelineModel()
>  CustomPipelineModel.transform() # wrong: results in calling 
> PipelineModel.transform() instead of CustomPipelineModel.transform(){code}
>  * Does this introduce any user-facing change?
>  No.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31165) Multiple wrong references in Dockerfile for k8s

2020-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31165.
--
Resolution: Not A Problem

> Multiple wrong references in Dockerfile for k8s 
> 
>
> Key: SPARK-31165
> URL: https://issues.apache.org/jira/browse/SPARK-31165
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Nikolay Dimolarov
>Priority: Minor
>
> I am currently trying to follow the k8s instructions for Spark: 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html] and when I 
> clone apache/spark on GitHub on the master branch I saw multiple wrong folder 
> references after trying to build my Docker image:
>  
> *Issue 1: The comments in the Dockerfile reference the wrong folder for the 
> Dockerfile:*
> {code:java}
> # If this docker file is being used in the context of building your images 
> from a Spark # distribution, the docker build command should be invoked from 
> the top level directory # of the Spark distribution. E.g.: # docker build -t 
> spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .{code}
> Well that docker build command simply won't run. I only got the following to 
> run:
> {code:java}
> docker build -t spark:latest -f 
> resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile . 
> {code}
> which is the actual path to the Dockerfile.
>  
> *Issue 2: jars folder does not exist*
> After I read the tutorial I of course build spark first as per the 
> instructions with:
> {code:java}
> ./build/mvn -Pkubernetes -DskipTests clean package{code}
> Nonetheless, in the Dockerfile I get this error when building:
> {code:java}
> Step 5/18 : COPY jars /opt/spark/jars
> COPY failed: stat /var/lib/docker/tmp/docker-builder402673637/jars: no such 
> file or directory{code}
>  for which I may have found a similar issue here: 
> [https://stackoverflow.com/questions/52451538/spark-for-kubernetes-test-on-mac]
> I am new to Spark but I assume that this jars folder - if the build step 
> would actually make it and I ran the maven build of the master branch 
> successfully with the command I mentioned above - would exist in the root 
> folder of the project. Turns out it's here:
> spark/assembly/target/scala-2.12/jars
>  
> *Issue 3: missing entrypoint.sh and decom.sh due to wrong reference*
> While Issue 2 remains unresolved as I can't wrap my head around the missing 
> jars folder (bin and sbin got copied successfully after I made a dummy jars 
> folder) I then got stuck on these 2 steps:
> {code:java}
> COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ COPY 
> kubernetes/dockerfiles/spark/decom.sh /opt/{code}
>  
>  with:
>   
> {code:java}
> Step 8/18 : COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
> COPY failed: stat 
> /var/lib/docker/tmp/docker-builder638219776/kubernetes/dockerfiles/spark/entrypoint.sh:
>  no such file or directory{code}
>  
>  which makes sense since the path should actually be:
>   
>  resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh
>  resource-managers/kubernetes/docker/src/main/dockerfiles/spark/decom.sh
>  
> *Issue 4: /tests/ has been renamed in /integration-tests/*
> **And the location is wrong.
> {code:java}
> COPY kubernetes/tests /opt/spark/tests
> {code}
> has to be changed to:
> {code:java}
> COPY resource-managers/kubernetes/integration-tests /opt/spark/tests{code}
> *Remark*
>   
>  I only created one issue since this seems like somebody cleaned up the repo 
> and forgot to change these. Am I missing something here? If I am, I apologise 
> in advance since I am new to the Spark project. I also saw that some of these 
> references were handled through vars in previous branches: 
> [https://github.com/apache/spark/blob/branch-2.4/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile]
>  (e.g. 2.4) but that also does not run out of the box.
>   
>  I am also really not sure about the affected versions since that was not 
> transparent enough for me on GH - feel free to edit that field :) 
>   
>  Thanks in advance!
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31149) PySpark job not killing Spark Daemon processes after the executor is killed due to OOM

2020-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31149.
--
Resolution: Won't Fix

> PySpark job not killing Spark Daemon processes after the executor is killed 
> due to OOM
> --
>
> Key: SPARK-31149
> URL: https://issues.apache.org/jira/browse/SPARK-31149
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Arsenii Venherak
>Priority: Major
>
> {code:java}
> 2020-03-10 10:15:00,257 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 327523 for container-id container_e25_1583
> 485217113_0347_01_42: 1.9 GB of 2 GB physical memory used; 39.5 GB of 4.2 
> GB virtual memory used
> 2020-03-10 10:15:05,135 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 327523 for container-id container_e25_1583
> 485217113_0347_01_42: 3.6 GB of 2 GB physical memory used; 41.1 GB of 4.2 
> GB virtual memory used
> 2020-03-10 10:15:05,136 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Process tree for container: container_e25_1583485217113_0347_01_42
>  has processes older than 1 iteration running over the configured limit. 
> Limit=2147483648, current usage = 3915513856
> 2020-03-10 10:15:05,136 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Container [pid=327523,containerID=container_e25_1583485217113_0347_01_
> 42] is running beyond physical memory limits. Current usage: 3.6 GB of 2 
> GB physical memory used; 41.1 GB of 4.2 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_e25_1583485217113_0347_01_42 :
> |- 327535 327523 327523 327523 (java) 1611 111 4044427264 172306 
> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/jre/bin/java 
> -server -Xmx1024m -Djava.io.tmpdir=/data/s
> cratch/yarn/usercache/u689299/appcache/application_1583485217113_0347/container_e25_1583485217113_0347_01_42/tmp
>  -Dspark.ssl.trustStore=/opt/mapr/conf/ssl_truststore -Dspark.authenticat
> e.enableSaslEncryption=true -Dspark.driver.port=40653 
> -Dspark.network.timeout=7200 -Dspark.ssl.keyStore=/opt/mapr/conf/ssl_keystore 
> -Dspark.network.sasl.serverAlwaysEncrypt=true -Dspark.ssl
> .enabled=true -Dspark.ssl.protocol=TLSv1.2 -Dspark.ssl.fs.enabled=true 
> -Dspark.ssl.ui.enabled=false -Dspark.authenticate=true 
> -Dspark.yarn.app.container.log.dir=/opt/mapr/hadoop/hadoop-2.7.
> 0/logs/userlogs/application_1583485217113_0347/container_e25_1583485217113_0347_01_42
>  -XX:OnOutOfMemoryError=kill %p 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> spark://coarsegrainedschedu...@bd02slse0201.wellsfargo.com:40653 
> --executor-id 40 --hostname bd02slsc0519.wellsfargo.com --cores 1 --app-id 
> application_1583485217113_0347 --user-class-path
> file:/data/scratch/yarn/usercache/u689299/appcache/application_1583485217113_0347/container_e25_1583485217113_0347_01_42/__app__.jar
> {code}
>  
>  
> After that, there are lots of pyspark.daemon process left.
>  eg:
>  /apps/anaconda3-5.3.0/bin/python -m pyspark.daemon



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30133) Support DELETE Jar and DELETE File functionality in spark

2020-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30133.
--
Resolution: Won't Fix

> Support DELETE Jar and DELETE File functionality in spark
> -
>
> Key: SPARK-30133
> URL: https://issues.apache.org/jira/browse/SPARK-30133
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Sandeep Katta
>Priority: Major
>  Labels: Umbrella
>
> SPARK should support delete jar feature
> This feature aims at solving below use case.
> Currently in spark add jar API supports to add the jar to executor and Driver 
> ClassPath at runtime, if there is any change in this jar definition there is 
> no way user can update the jar to executor and Driver classPath. User needs 
> to restart the application to solve this problem which is costly operation.
> After this JIRA fix user can use delete jar API to remove the jar from Driver 
> and Executor ClassPath without the need of restarting the  any spark 
> application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30135) Add documentation for DELETE JAR and DELETE File command

2020-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30135.
--
Resolution: Won't Fix

> Add documentation for DELETE JAR and DELETE File command
> 
>
> Key: SPARK-30135
> URL: https://issues.apache.org/jira/browse/SPARK-30135
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30134) DELETE JAR should remove from addedJars list and from classpath

2020-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30134.
--
Resolution: Won't Fix

> DELETE JAR should remove  from addedJars list and from classpath
> 
>
> Key: SPARK-30134
> URL: https://issues.apache.org/jira/browse/SPARK-30134
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30137) Support DELETE file

2020-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30137.
--
Resolution: Won't Fix

> Support DELETE file 
> 
>
> Key: SPARK-30137
> URL: https://issues.apache.org/jira/browse/SPARK-30137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30136) DELETE JAR should also remove the jar from executor classPath

2020-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30136.
--
Resolution: Won't Fix

> DELETE JAR should also remove the jar from executor classPath
> -
>
> Key: SPARK-30136
> URL: https://issues.apache.org/jira/browse/SPARK-30136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31598) LegacySimpleTimestampFormatter incorrectly interprets pre-Gregorian timestamps

2020-04-28 Thread Bruce Robbins (Jira)

Bruce Robbins created SPARK-31598:
-

 Summary: LegacySimpleTimestampFormatter incorrectly interprets 
pre-Gregorian timestamps
 Key: SPARK-31598
 URL: https://issues.apache.org/jira/browse/SPARK-31598
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Bruce Robbins


As per discussion with [~maxgekk]:

{{LegacySimpleTimestampFormatter#parse}} misinterprets pre-Gregorian timestamps:
{noformat}
scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
res0: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val df1 = Seq("0002-01-01 00:00:00", "1000-01-01 00:00:00", "1800-01-01 
00:00:00").toDF("expected")
df1: org.apache.spark.sql.DataFrame = [expected: string]

scala> val df2 = df1.select('expected, to_timestamp('expected, "-MM-dd 
HH:mm:ss").as("actual"))
df2: org.apache.spark.sql.DataFrame = [expected: string, actual: timestamp]

scala> df2.show(truncate=false)
+---+---+
|expected   |actual |
+---+---+
|0002-01-01 00:00:00|0001-12-30 00:00:00|
|1000-01-01 00:00:00|1000-01-06 00:00:00|
|1800-01-01 00:00:00|1800-01-01 00:00:00|
+---+---+


scala> 
{noformat}
Legacy timestamp parsing with JSON and CSV files is correct, so apparently 
{{LegacyFastTimestampFormatter}} does not have this issue (need to double 
check).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe

2020-04-28 Thread Yunbo Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094669#comment-17094669
 ] 

Yunbo Fan commented on SPARK-31592:
---

I checked my executor log again and I find the executor got NPE first
{code}
java.lang.NullPointerException
at 
org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:58)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:302)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96)
at 
org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap:800)
...
{code}
 And later got the NoSuchElementExceptionException.
{code}
java.util.NoSuchElementExceptionException
at java.util.LinkedList.removeFirst(LinkedList.java:270)
at java.util.LinkedList.remove(LinkedList.java:685)
at 
org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:302)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96)
at 
org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap:800)
...
{code}
But I can't find out why NPE error here. Maybe a null WeakReference 
added?

> bufferPoolsBySize in HeapMemoryAllocator should be thread safe
> --
>
> Key: SPARK-31592
> URL: https://issues.apache.org/jira/browse/SPARK-31592
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Yunbo Fan
>Priority: Major
>
> Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose 
> value type is LinkedList.
> LinkedList is not thread safe and may hit the error below
> {code:java}
> java.util.NoSuchElementExceptionException
> at java.util.LinkedList.removeFirst(LinkedList.java:270) 
> at java.util.LinkedList.remove(LinkedList.java:685)
> at 
> org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe

2020-04-28 Thread Yunbo Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunbo Fan updated SPARK-31592:
--
Affects Version/s: (was: 2.4.5)
   2.4.3

> bufferPoolsBySize in HeapMemoryAllocator should be thread safe
> --
>
> Key: SPARK-31592
> URL: https://issues.apache.org/jira/browse/SPARK-31592
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Yunbo Fan
>Priority: Major
>
> Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose 
> value type is LinkedList.
> LinkedList is not thread safe and may hit the error below
> {code:java}
> java.util.NoSuchElementExceptionException
> at java.util.LinkedList.removeFirst(LinkedList.java:270) 
> at java.util.LinkedList.remove(LinkedList.java:685)
> at 
> org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29458) Document scalar functions usage in APIs in SQL getting started.

2020-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29458.
--
Fix Version/s: 3.0.0
 Assignee: Huaxin Gao
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28290

> Document scalar functions usage in APIs in SQL getting started.
> ---
>
> Key: SPARK-29458
> URL: https://issues.apache.org/jira/browse/SPARK-29458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Dilip Biswal
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved

2020-04-28 Thread Costas Piliotis (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094651#comment-17094651
 ] 

Costas Piliotis edited comment on SPARK-31583 at 4/28/20, 4:16 PM:
---

[~maropu] I'm trying to avoid referencing the SPARK-21858 which already 
addresses the flipped bits.  Specifically this is about how spark decides where 
to allocate the grouping_id based on the ordinal position in the grouping sets 
rather than the ordinal position in the select clause.   Does that make sense?

So if I have  SELECT a,b,c,d FROM... GROUPING SETS (  (a,b,d), (a,b,c) ) the 
grouping_id bits would be determined as cdba the or instead of dcba.I 
believe if we look at most RDBMS that has grouping sets identified, my only 
suggestion is that it would be more predictable if the bit order in the 
grouping_id were determined by the ordinal position in the select.   

The flipped bits, is a separate ticket and I do believe the implementation 
should be predictably the same as other implementation in established RDBMS SQL 
implementations where 1=included, 0=excluded, but that matter is closed to 
discussion.   


was (Author: cpiliotis):
[~maropu] I'm trying to avoid referencing the SPARK-21858 which already 
addresses the flipped bits.  Specifically this is about how spark decides where 
to allocate the grouping_id based on the ordinal position in the grouping sets 
rather than the ordinal position in the select clause.   Does that make sense?

So if I have  SELECT a,b,c,d FROM... GROUPING SETS (  (a,b,d), (a,b,c) ) the 
grouping_id would be abdc instead of abcd.I believe if we look at most 
RDBMS that has grouping sets identified, my only suggestion is that it would be 
more predictable if the bit order in the grouping_id were determined by the 
ordinal position in the select.   

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved

2020-04-28 Thread Costas Piliotis (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094651#comment-17094651
 ] 

Costas Piliotis commented on SPARK-31583:
-

[~maropu] I'm trying to avoid referencing the SPARK-21858 which already 
addresses the flipped bits.  Specifically this is about how spark decides where 
to allocate the grouping_id based on the ordinal position in the grouping sets 
rather than the ordinal position in the select clause.   Does that make sense?

So if I have  SELECT a,b,c,d FROM... GROUPING SETS (  (a,b,d), (a,b,c) ) the 
grouping_id would be abdc instead of abcd.I believe if we look at most 
RDBMS that has grouping sets identified, my only suggestion is that it would be 
more predictable if the bit order in the grouping_id were determined by the 
ordinal position in the select.   

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Labels: correctness  (was: )

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31534) Text for tooltip should be escaped

2020-04-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31534:
--
Fix Version/s: 3.0.0

> Text for tooltip should be escaped
> --
>
> Key: SPARK-31534
> URL: https://issues.apache.org/jira/browse/SPARK-31534
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.0.0, 3.1.0
>
>
> Timeline View for application and job, and DAG Viz for job show tooltip but 
> its text are not escaped for HTML so they cannot be shown properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection

2020-04-28 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094627#comment-17094627
 ] 

Dongjoon Hyun commented on SPARK-29048:
---

This is reverted via 
https://github.com/apache/spark/commit/b7cabc80e6df523f0377b651fdbdc2a669c11550

> Query optimizer slow when using Column.isInCollection() with a large size 
> collection
> 
>
> Key: SPARK-29048
> URL: https://issues.apache.org/jira/browse/SPARK-29048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Weichen Xu
>Priority: Major
>
> Query optimizer slow when using Column.isInCollection() with a large size 
> collection.
> The query optimizer takes a long time to do its thing and on the UI all I see 
> is "Running commands". This can take from 10s of minutes to 11 hours 
> depending on how many values there are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection

2020-04-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29048:
--
Fix Version/s: (was: 3.0.0)

> Query optimizer slow when using Column.isInCollection() with a large size 
> collection
> 
>
> Key: SPARK-29048
> URL: https://issues.apache.org/jira/browse/SPARK-29048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Weichen Xu
>Priority: Major
>
> Query optimizer slow when using Column.isInCollection() with a large size 
> collection.
> The query optimizer takes a long time to do its thing and on the UI all I see 
> is "Running commands". This can take from 10s of minutes to 11 hours 
> depending on how many values there are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection

2020-04-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-29048:
---
  Assignee: (was: Weichen Xu)

> Query optimizer slow when using Column.isInCollection() with a large size 
> collection
> 
>
> Key: SPARK-29048
> URL: https://issues.apache.org/jira/browse/SPARK-29048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Query optimizer slow when using Column.isInCollection() with a large size 
> collection.
> The query optimizer takes a long time to do its thing and on the UI all I see 
> is "Running commands". This can take from 10s of minutes to 11 hours 
> depending on how many values there are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31404) file source backward compatibility after calendar switch

2020-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31404:

Summary: file source backward compatibility after calendar switch  (was: 
file source backward compatibility issues after switching to Proleptic 
Gregorian calendar)

> file source backward compatibility after calendar switch
> 
>
> Key: SPARK-31404
> URL: https://issues.apache.org/jira/browse/SPARK-31404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java 
> 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but 
> introduces some backward compatibility problems:
> 1. may read wrong data from the data files written by Spark 2.4
> 2. may have perf regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31404) file source backward compatibility issues after switching to Proleptic Gregorian calendar

2020-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31404:

Summary: file source backward compatibility issues after switching to 
Proleptic Gregorian calendar  (was: backward compatibility issues after 
switching to Proleptic Gregorian calendar)

> file source backward compatibility issues after switching to Proleptic 
> Gregorian calendar
> -
>
> Key: SPARK-31404
> URL: https://issues.apache.org/jira/browse/SPARK-31404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java 
> 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but 
> introduces some backward compatibility problems:
> 1. may read wrong data from the data files written by Spark 2.4
> 2. may have perf regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31597) extracting day from intervals should be interval.days + days in interval.microsecond

2020-04-28 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094560#comment-17094560
 ] 

Kent Yao commented on SPARK-31597:
--

work log manually [https://github.com/apache/spark/pull/28396]

> extracting day from intervals should be interval.days + days in 
> interval.microsecond
> 
>
> Key: SPARK-31597
> URL: https://issues.apache.org/jira/browse/SPARK-31597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> checked with both Presto and PostgresSQL, one is implemented intervals with 
> ANSI style year-month/day-time, the other is mixed and Non-ANSI. They both 
> add the exceeded days in interval time part to the total days of the 
> operation which extracts day from interval values
>  
> ```sql
> presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - 
> cast('2020-01-01 00:00:00' as timestamp)));
>  _col0
> ---
>  14
> (1 row)
> Query 20200428_135239_0_ahn7x, FINISHED, 1 node
> Splits: 17 total, 17 done (100.00%)
> 0:01 [0 rows, 0B] [0 rows/s, 0B/s]
> presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - 
> cast('2020-01-01 00:00:01' as timestamp)));
>  _col0
> ---
>  13
> (1 row)
> Query 20200428_135246_1_ahn7x, FINISHED, 1 node
> Splits: 17 total, 17 done (100.00%)
> 0:00 [0 rows, 0B] [0 rows/s, 0B/s]
> presto>
> ```
> ```scala
> postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) 
> - cast('2020-01-01 00:00:00' as timestamp)));
>  date_part
> ---
>  14
> (1 row)
> postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) 
> - cast('2020-01-01 00:00:01' as timestamp)));
>  date_part
> ---
>  13
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31597) extracting day from intervals should be interval.days + days in interval.microsecond

2020-04-28 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094560#comment-17094560
 ] 

Kent Yao edited comment on SPARK-31597 at 4/28/20, 2:30 PM:


work logged manually [https://github.com/apache/spark/pull/28396]


was (Author: qin yao):
work log manually [https://github.com/apache/spark/pull/28396]

> extracting day from intervals should be interval.days + days in 
> interval.microsecond
> 
>
> Key: SPARK-31597
> URL: https://issues.apache.org/jira/browse/SPARK-31597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> checked with both Presto and PostgresSQL, one is implemented intervals with 
> ANSI style year-month/day-time, the other is mixed and Non-ANSI. They both 
> add the exceeded days in interval time part to the total days of the 
> operation which extracts day from interval values
>  
> ```sql
> presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - 
> cast('2020-01-01 00:00:00' as timestamp)));
>  _col0
> ---
>  14
> (1 row)
> Query 20200428_135239_0_ahn7x, FINISHED, 1 node
> Splits: 17 total, 17 done (100.00%)
> 0:01 [0 rows, 0B] [0 rows/s, 0B/s]
> presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - 
> cast('2020-01-01 00:00:01' as timestamp)));
>  _col0
> ---
>  13
> (1 row)
> Query 20200428_135246_1_ahn7x, FINISHED, 1 node
> Splits: 17 total, 17 done (100.00%)
> 0:00 [0 rows, 0B] [0 rows/s, 0B/s]
> presto>
> ```
> ```scala
> postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) 
> - cast('2020-01-01 00:00:00' as timestamp)));
>  date_part
> ---
>  14
> (1 row)
> postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) 
> - cast('2020-01-01 00:00:01' as timestamp)));
>  date_part
> ---
>  13
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31553) Wrong result of isInCollection for large collections

2020-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31553:
---

Assignee: Maxim Gekk

> Wrong result of isInCollection for large collections
> 
>
> Key: SPARK-31553
> URL: https://issues.apache.org/jira/browse/SPARK-31553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
>
> If the size of a collection passed to isInCollection is bigger than 
> spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
> results for some inputs. For example:
> {code:scala}
> val set = (0 to 20).map(_.toString).toSet
> val data = Seq("1").toDF("x")
> println(set.contains("1"))
> data.select($"x".isInCollection(set).as("isInCollection")).show()
> {code}
> {code}
> true
> +--+
> |isInCollection|
> +--+
> | false|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31553) Wrong result of isInCollection for large collections

2020-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31553.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28388
[https://github.com/apache/spark/pull/28388]

> Wrong result of isInCollection for large collections
> 
>
> Key: SPARK-31553
> URL: https://issues.apache.org/jira/browse/SPARK-31553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> If the size of a collection passed to isInCollection is bigger than 
> spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
> results for some inputs. For example:
> {code:scala}
> val set = (0 to 20).map(_.toString).toSet
> val data = Seq("1").toDF("x")
> println(set.contains("1"))
> data.select($"x".isInCollection(set).as("isInCollection")).show()
> {code}
> {code}
> true
> +--+
> |isInCollection|
> +--+
> | false|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)

2020-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31586.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28381
[https://github.com/apache/spark/pull/28381]

> Replace expression TimeSub(l, r) with TimeAdd(l -r)
> ---
>
> Key: SPARK-31586
> URL: https://issues.apache.org/jira/browse/SPARK-31586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.1.0
>
>
> The implementation of TimeSub for the operation of timestamp subtracting 
> interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, 
> -r) since there are equivalent. 
> Suggestion from 
> https://github.com/apache/spark/pull/28310#discussion_r414259239



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)

2020-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31586:
---

Assignee: Kent Yao

> Replace expression TimeSub(l, r) with TimeAdd(l -r)
> ---
>
> Key: SPARK-31586
> URL: https://issues.apache.org/jira/browse/SPARK-31586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> The implementation of TimeSub for the operation of timestamp subtracting 
> interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, 
> -r) since there are equivalent. 
> Suggestion from 
> https://github.com/apache/spark/pull/28310#discussion_r414259239



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31597) extracting day from intervals should be interval.days + days in interval.microsecond

2020-04-28 Thread Kent Yao (Jira)

Kent Yao created SPARK-31597:


 Summary: extracting day from intervals should be interval.days + 
days in interval.microsecond
 Key: SPARK-31597
 URL: https://issues.apache.org/jira/browse/SPARK-31597
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


checked with both Presto and PostgresSQL, one is implemented intervals with 
ANSI style year-month/day-time, the other is mixed and Non-ANSI. They both add 
the exceeded days in interval time part to the total days of the operation 
which extracts day from interval values

 

```sql

presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - 
cast('2020-01-01 00:00:00' as timestamp)));
 _col0
---
 14
(1 row)

Query 20200428_135239_0_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - 
cast('2020-01-01 00:00:01' as timestamp)));
 _col0
---
 13
(1 row)

Query 20200428_135246_1_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto>

```

```scala

postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - 
cast('2020-01-01 00:00:00' as timestamp)));
 date_part
---
 14
(1 row)

postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - 
cast('2020-01-01 00:00:01' as timestamp)));
 date_part
---
 13

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31596) Generate SQL Configurations from hive module to configuration doc

2020-04-28 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31596:
-
Description: ATT

> Generate SQL Configurations from hive module to configuration doc
> -
>
> Key: SPARK-31596
> URL: https://issues.apache.org/jira/browse/SPARK-31596
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> ATT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31596) Generate SQL Configurations from hive module to configuration doc

2020-04-28 Thread Kent Yao (Jira)

Kent Yao created SPARK-31596:


 Summary: Generate SQL Configurations from hive module to 
configuration doc
 Key: SPARK-31596
 URL: https://issues.apache.org/jira/browse/SPARK-31596
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string

2020-04-28 Thread Adrian Wang (Jira)

Adrian Wang created SPARK-31595:
---

 Summary: Spark sql cli should allow unescaped quote mark in quoted 
string
 Key: SPARK-31595
 URL: https://issues.apache.org/jira/browse/SPARK-31595
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Adrian Wang


spark-sql> select "'";
spark-sql> select '"';

In Spark parser if we pass a text of `select "'";`, there will be 
ParserCancellationException, which will be handled by PredictionMode.LL. By 
dropping `;` correctly we can avoid that retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31594) Do not display rand/randn seed numbers in schema

2020-04-28 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094362#comment-17094362
 ] 

Takeshi Yamamuro commented on SPARK-31594:
--

I'm working on this https://github.com/apache/spark/pull/28392

> Do not display rand/randn seed numbers in schema
> 
>
> Key: SPARK-31594
> URL: https://issues.apache.org/jira/browse/SPARK-31594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31594) Do not display rand/randn seed numbers in schema

2020-04-28 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31594:
-
Summary: Do not display rand/randn seed numbers in schema  (was: Do not 
display rand/randn seed in schema)

> Do not display rand/randn seed numbers in schema
> 
>
> Key: SPARK-31594
> URL: https://issues.apache.org/jira/browse/SPARK-31594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31594) Do not display rand/randn seed in schema

2020-04-28 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-31594:


 Summary: Do not display rand/randn seed in schema
 Key: SPARK-31594
 URL: https://issues.apache.org/jira/browse/SPARK-31594
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31593) Remove unnecessary streaming query progress update

2020-04-28 Thread Genmao Yu (Jira)

Genmao Yu created SPARK-31593:
-

 Summary: Remove unnecessary streaming query progress update
 Key: SPARK-31593
 URL: https://issues.apache.org/jira/browse/SPARK-31593
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.5, 3.0.0
Reporter: Genmao Yu


Structured Streaming progress reporter will always report an `empty` progress 
when there is no new data. As design, we should provide progress updates every 
10s (default) when there is no new data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe

2020-04-28 Thread Yunbo Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunbo Fan updated SPARK-31592:
--
Description: 
Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose value 
type is LinkedList.

LinkedList is not thread safe and may hit the error below
{code:java}
java.util.NoSuchElementExceptionException
at java.util.LinkedList.removeFirst(LinkedList.java:270) 
at java.util.LinkedList.remove(LinkedList.java:685)
at 
org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code}
 

 

  was:
Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose value 
type is LinkedList.

LinkedList is not thread safe and may hit the error below

 
{code:java}
java.util.NoSuchElementExceptionException
at java.util.LinkedList.removeFirst(LinkedList.java:270) 
at java.util.LinkedList.remove(LinkedList.java:685)
at 
org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code}
 

 


> bufferPoolsBySize in HeapMemoryAllocator should be thread safe
> --
>
> Key: SPARK-31592
> URL: https://issues.apache.org/jira/browse/SPARK-31592
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Yunbo Fan
>Priority: Major
>
> Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose 
> value type is LinkedList.
> LinkedList is not thread safe and may hit the error below
> {code:java}
> java.util.NoSuchElementExceptionException
> at java.util.LinkedList.removeFirst(LinkedList.java:270) 
> at java.util.LinkedList.remove(LinkedList.java:685)
> at 
> org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe

2020-04-28 Thread Yunbo Fan (Jira)

Yunbo Fan created SPARK-31592:
-

 Summary: bufferPoolsBySize in HeapMemoryAllocator should be thread 
safe
 Key: SPARK-31592
 URL: https://issues.apache.org/jira/browse/SPARK-31592
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.5
Reporter: Yunbo Fan


Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose value 
type is LinkedList.

LinkedList is not thread safe and may hit the error below

 
{code:java}
java.util.NoSuchElementExceptionException
at java.util.LinkedList.removeFirst(LinkedList.java:270) 
at java.util.LinkedList.remove(LinkedList.java:685)
at 
org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26924) Fix CRAN hack as soon as Arrow is available on CRAN

2020-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26924.
--
Resolution: Duplicate

> Fix CRAN hack as soon as Arrow is available on CRAN
> ---
>
> Key: SPARK-26924
> URL: https://issues.apache.org/jira/browse/SPARK-26924
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Arrow optimization was added but Arrow is not available on CRAN.
> So, it had to add some hacks to avoid CRAN check in SparkR side. For example, 
> see 
> https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1
> These should be removed to properly check CRAN in SparkR
> See also ARROW-3204



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved

2020-04-28 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094284#comment-17094284
 ] 

Takeshi Yamamuro commented on SPARK-31583:
--

[~cpiliotis] Hi, thanks for your report! Just a question; you proposed the two 
things below in this JIRA?

 - reordering bit positions in grouping_id corresponding to a projection list 
in select
 - flipping the current output in grouping_id

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved

2020-04-28 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094284#comment-17094284
 ] 

Takeshi Yamamuro edited comment on SPARK-31583 at 4/28/20, 8:28 AM:


[~cpiliotis] Hi, thanks for your report! Just to check; you proposed the two 
things below in this JIRA?

 - reordering bit positions in grouping_id corresponding to a projection list 
in select
 - flipping the current output in grouping_id


was (Author: maropu):
[~cpiliotis] Hi, thanks for your report! Just a question; you proposed the two 
things below in this JIRA?

 - reordering bit positions in grouping_id corresponding to a projection list 
in select
 - flipping the current output in grouping_id

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31573) Use fixed=TRUE where possible for internal efficiency

2020-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31573:
-
Issue Type: Bug  (was: Documentation)

> Use fixed=TRUE where possible for internal efficiency
> -
>
> Key: SPARK-31573
> URL: https://issues.apache.org/jira/browse/SPARK-31573
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Assignee: Michael Chirico
>Priority: Minor
> Fix For: 3.0.0
>
>
> gsub('_', '', x) is more efficient if we signal there's no regex: gsub('_', 
> '', x, fixed = TRUE)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31573) Use fixed=TRUE where possible for internal efficiency

2020-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31573.
--
  Assignee: Michael Chirico
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28367

> Use fixed=TRUE where possible for internal efficiency
> -
>
> Key: SPARK-31573
> URL: https://issues.apache.org/jira/browse/SPARK-31573
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Assignee: Michael Chirico
>Priority: Minor
>
> gsub('_', '', x) is more efficient if we signal there's no regex: gsub('_', 
> '', x, fixed = TRUE)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31573) Use fixed=TRUE where possible for internal efficiency

2020-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31573:
-
Fix Version/s: 3.0.0

> Use fixed=TRUE where possible for internal efficiency
> -
>
> Key: SPARK-31573
> URL: https://issues.apache.org/jira/browse/SPARK-31573
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Assignee: Michael Chirico
>Priority: Minor
> Fix For: 3.0.0
>
>
> gsub('_', '', x) is more efficient if we signal there's no regex: gsub('_', 
> '', x, fixed = TRUE)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31519.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28294
[https://github.com/apache/spark/pull/28294]

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31519:
---

Assignee: Yuanjian Li

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30868) Throw Exception if runHive(sql) failed

2020-04-28 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094274#comment-17094274
 ] 

Yuming Wang commented on SPARK-30868:
-

Issue resolved by pull request 27644
https://github.com/apache/spark/pull/27644

> Throw Exception if runHive(sql) failed
> --
>
> Key: SPARK-30868
> URL: https://issues.apache.org/jira/browse/SPARK-30868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
> Fix For: 3.0.0
>
>
> At present, HiveClientImpl.runHive will not throw an exception when it runs 
> incorrectly, which will cause it to fail to feedback error information 
> normally.
> Example
> {code:scala}
> spark.sql("add jar file:///tmp/test.jar")
> spark.sql("show databases").show()
> {code}
> /tmp/test.jar doesn't exist, thus add jar is failed. However this code will 
> run completely without causing application failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30868) Throw Exception if runHive(sql) failed

2020-04-28 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-30868.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Throw Exception if runHive(sql) failed
> --
>
> Key: SPARK-30868
> URL: https://issues.apache.org/jira/browse/SPARK-30868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
> Fix For: 3.0.0
>
>
> At present, HiveClientImpl.runHive will not throw an exception when it runs 
> incorrectly, which will cause it to fail to feedback error information 
> normally.
> Example
> {code:scala}
> spark.sql("add jar file:///tmp/test.jar")
> spark.sql("show databases").show()
> {code}
> /tmp/test.jar doesn't exist, thus add jar is failed. However this code will 
> run completely without causing application failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30868) Throw Exception if runHive(sql) failed

2020-04-28 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-30868:
---

Assignee: Jackey Lee

> Throw Exception if runHive(sql) failed
> --
>
> Key: SPARK-30868
> URL: https://issues.apache.org/jira/browse/SPARK-30868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
>
> At present, HiveClientImpl.runHive will not throw an exception when it runs 
> incorrectly, which will cause it to fail to feedback error information 
> normally.
> Example
> {code:scala}
> spark.sql("add jar file:///tmp/test.jar")
> spark.sql("show databases").show()
> {code}
> /tmp/test.jar doesn't exist, thus add jar is failed. However this code will 
> run completely without causing application failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31524) Add metric to the split number for skew partition when enable AQE

2020-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31524.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28109
[https://github.com/apache/spark/pull/28109]

> Add metric to the split  number for skew partition when enable AQE
> --
>
> Key: SPARK-31524
> URL: https://issues.apache.org/jira/browse/SPARK-31524
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.1.0
>
>
> Add the details metrics for the split number in skewed partitions when enable 
> AQE and skew join optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31524) Add metric to the split number for skew partition when enable AQE

2020-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31524:
---

Assignee: Ke Jia

> Add metric to the split  number for skew partition when enable AQE
> --
>
> Key: SPARK-31524
> URL: https://issues.apache.org/jira/browse/SPARK-31524
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> Add the details metrics for the split number in skewed partitions when enable 
> AQE and skew join optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26199) Long expressions cause mutate to fail

2020-04-28 Thread Michael Chirico (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094232#comment-17094232
 ] 

Michael Chirico commented on SPARK-26199:
-

Just saw this. 

https://issues.apache.org/jira/browse/SPARK-31517

is a duplicate of this.

PR to fix it is here:

https://github.com/apache/spark/pull/28386

I'll tag this Jira as well.

> Long expressions cause mutate to fail
> -
>
> Key: SPARK-26199
> URL: https://issues.apache.org/jira/browse/SPARK-26199
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: João Rafael
>Priority: Minor
>
> Calling {{mutate(df, field = expr)}} fails when expr is very long.
> Example:
> {code:R}
> df <- mutate(df, field = ifelse(
> lit(TRUE),
> lit("A"),
> ifelse(
> lit(T),
> lit("BB"),
> lit("C")
> )
> ))
> {code}
> Stack trace:
> {code:R}
> FATAL subscript out of bounds
>   at .handleSimpleError(function (obj) 
> {
> level = sapply(class(obj), sw
>   at FUN(X[[i]], ...)
>   at lapply(seq_along(args), function(i) {
> if (ns[[i]] != "") {
> at lapply(seq_along(args), function(i) {
> if (ns[[i]] != "") {
> at mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T), lit("BBB
>   at #78: mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T
> {code}
> The root cause is in: 
> [DataFrame.R#LL2182|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2182]
> When the expression is long {{deparse}} returns multiple lines, causing 
> {{args}} to have more elements than {{ns}}. The solution could be to set 
> {{nlines = 1}} or to collapse the lines together.
> A simple work around exists, by first placing the expression in a variable 
> and using it instead:
> {code:R}
> tmp <- ifelse(
> lit(TRUE),
> lit("A"),
> ifelse(
> lit(T),
> lit("BB"),
> lit("C")
> )
> )
> df <- mutate(df, field = tmp)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)

2020-04-28 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094231#comment-17094231
 ] 

Kent Yao commented on SPARK-31586:
--

Hi [~Ankitraj],  PR is ready [https://github.com/apache/spark/pull/28381]

> Replace expression TimeSub(l, r) with TimeAdd(l -r)
> ---
>
> Key: SPARK-31586
> URL: https://issues.apache.org/jira/browse/SPARK-31586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> The implementation of TimeSub for the operation of timestamp subtracting 
> interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, 
> -r) since there are equivalent. 
> Suggestion from 
> https://github.com/apache/spark/pull/28310#discussion_r414259239



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31589) Use `r-lib/actions/setup-r` in GitHub Action

2020-04-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31589:
--
Fix Version/s: 2.4.6

> Use `r-lib/actions/setup-r` in GitHub Action
> 
>
> Key: SPARK-31589
> URL: https://issues.apache.org/jira/browse/SPARK-31589
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> `r-lib/actions/setup-r` is more stabler and maintained 3rd party action.
> I made this issue as `Bug` since the branch is currently broken.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31583) grouping_id calculation should be improved

2020-04-28 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31583:
-
Component/s: (was: Spark Core)
 SQL

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31583) grouping_id calculation should be improved

2020-04-28 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31583:
-
Affects Version/s: (was: 2.4.5)
   3.1.0

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)

2020-04-28 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094217#comment-17094217
 ] 

Ankit Raj Boudh commented on SPARK-31586:
-

Hi Kent Yao, are you working on this issue ?, if not can i start working on 
this issue.

> Replace expression TimeSub(l, r) with TimeAdd(l -r)
> ---
>
> Key: SPARK-31586
> URL: https://issues.apache.org/jira/browse/SPARK-31586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> The implementation of TimeSub for the operation of timestamp subtracting 
> interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, 
> -r) since there are equivalent. 
> Suggestion from 
> https://github.com/apache/spark/pull/28310#discussion_r414259239



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31591) namePrefix could be null in Utils.createDirectory

2020-04-28 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094215#comment-17094215
 ] 

Lantao Jin commented on SPARK-31591:


https://github.com/apache/spark/pull/28385

> namePrefix could be null in Utils.createDirectory
> -
>
> Key: SPARK-31591
> URL: https://issues.apache.org/jira/browse/SPARK-31591
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In our production, we find that many shuffle files could be located in
> /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a
> The Util.createDirectory() uses a default parameter "spark"
> {code}
>   def createDirectory(root: String, namePrefix: String = "spark"): File = {
> {code}
> But in some cases, the actual namePrefix is null. If the method is called 
> with null, then the default value would not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31591) namePrefix could be null in Utils.createDirectory

2020-04-28 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094214#comment-17094214
 ] 

Lantao Jin commented on SPARK-31591:


[~Ankitraj] I have already filed a PR.

> namePrefix could be null in Utils.createDirectory
> -
>
> Key: SPARK-31591
> URL: https://issues.apache.org/jira/browse/SPARK-31591
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In our production, we find that many shuffle files could be located in
> /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a
> The Util.createDirectory() uses a default parameter "spark"
> {code}
>   def createDirectory(root: String, namePrefix: String = "spark"): File = {
> {code}
> But in some cases, the actual namePrefix is null. If the method is called 
> with null, then the default value would not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31591) namePrefix could be null in Utils.createDirectory

2020-04-28 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094210#comment-17094210
 ] 

Ankit Raj Boudh commented on SPARK-31591:
-

Hi [~cltlfcjin] , I will start working on this issue. 

> namePrefix could be null in Utils.createDirectory
> -
>
> Key: SPARK-31591
> URL: https://issues.apache.org/jira/browse/SPARK-31591
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In our production, we find that many shuffle files could be located in
> /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a
> The Util.createDirectory() uses a default parameter "spark"
> {code}
>   def createDirectory(root: String, namePrefix: String = "spark"): File = {
> {code}
> But in some cases, the actual namePrefix is null. If the method is called 
> with null, then the default value would not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error

2020-04-28 Thread Michael Chirico (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094200#comment-17094200
 ] 

Michael Chirico commented on SPARK-31517:
-

The issue is the use of deparse() in mutate; over() is longer than the 
default width.cutoff, so sapply() returns > 1 element.

> SparkR::orderBy with multiple columns descending produces error
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Priority: Major
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
> library(magrittr) 
> library(SparkR) 
> cars <- cbind(model = rownames(mtcars), mtcars) 
> carsDF <- createDataFrame(cars) 
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> desc(column("mpg")), desc(column("disp") %>% 
>   head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code, 
> excluding the use of the `desc()` function also fails:
> {code:java}
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> column("mpg"), column("disp" %>% 
>   head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error

2020-04-28 Thread Michael Chirico (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094193#comment-17094193
 ] 

Michael Chirico commented on SPARK-31517:
-

Separating the window to its own step works:

window = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
desc(column("mpg")), desc(column("disp"
carsDF %>% 
  mutate(rank = window) %>% 
  head() 

So there's something in the logic of mutate that doesn't handle the nested call.

> SparkR::orderBy with multiple columns descending produces error
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Priority: Major
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
> library(magrittr) 
> library(SparkR) 
> cars <- cbind(model = rownames(mtcars), mtcars) 
> carsDF <- createDataFrame(cars) 
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> desc(column("mpg")), desc(column("disp") %>% 
>   head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code, 
> excluding the use of the `desc()` function also fails:
> {code:java}
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> column("mpg"), column("disp" %>% 
>   head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved

2020-04-28 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094188#comment-17094188
 ] 

Takeshi Yamamuro commented on SPARK-31583:
--

ok, I'll take a look.

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31591) namePrefix could be null in Utils.createDirectory

2020-04-28 Thread Lantao Jin (Jira)

Lantao Jin created SPARK-31591:
--

 Summary: namePrefix could be null in Utils.createDirectory
 Key: SPARK-31591
 URL: https://issues.apache.org/jira/browse/SPARK-31591
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Lantao Jin


In our production, we find that many shuffle files could be located in
/hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a

The Util.createDirectory() uses a default parameter "spark"
{code}
  def createDirectory(root: String, namePrefix: String = "spark"): File = {
{code}
But in some cases, the actual namePrefix is null. If the method is called with 
null, then the default value would not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

89 matches

Mail list logo